Is there a solution to load cyrillic links via urlopen?

V

Vampre2018-01-04 04:30:45

Python

Vampre, 2018-01-04 04:30:45

I made a script that creates a sitemap. Initially, I used urlopen(url) to load pages, but when an address like "site-address.rf" appeared in the url, an error appeared UnicodeEncodeError: 'ascii' codec can't encode characters in position 23-26: ordinal not in range(128) . I tried both urllib.quote_plus and urllib.quote, in this case it was ValueError: unknown url type: 'http%3A%2F%2F....'
In general, I solved the problem by using requests instead of urllib, there with requests.get (url) everything loaded without problems.
Just wondering how this problem is solved in urllib?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Artem Sovetnikov, 2018-01-05
@Sovetnikov

The domain name must be encoded in idna and transferred to urlopen already in utf8

from urllib.parse import urlunparse, urlparse
from urllib.request import urlopen
url = 'http://сайт.рф'
scheme, netloc, path, params, query, fragment = urlparse(url)
url = urlunparse((scheme, netloc.encode('idna').decode('utf8'), path, params, query, fragment))
r = urlopen(url)
print(r.read()[0:50])