V
V
Vampre2018-01-04 04:30:45
Python
Vampre, 2018-01-04 04:30:45

Is there a solution to load cyrillic links via urlopen?

I made a script that creates a sitemap. Initially, I used urlopen(url) to load pages, but when an address like "site-address.rf" appeared in the url, an error appeared UnicodeEncodeError: 'ascii' codec can't encode characters in position 23-26: ordinal not in range(128) . I tried both urllib.quote_plus and urllib.quote, in this case it was ValueError: unknown url type: 'http%3A%2F%2F....'
In general, I solved the problem by using requests instead of urllib, there with requests.get (url) everything loaded without problems.
Just wondering how this problem is solved in urllib?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Artem Sovetnikov, 2018-01-05
@Sovetnikov

The domain name must be encoded in idna and transferred to urlopen already in utf8

from urllib.parse import urlunparse, urlparse
from urllib.request import urlopen
url = 'http://сайт.рф'
scheme, netloc, path, params, query, fragment = urlparse(url)
url = urlunparse((scheme, netloc.encode('idna').decode('utf8'), path, params, query, fragment))
r = urlopen(url)
print(r.read()[0:50])

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question