Answer the question
In order to leave comments, you need to log in
Is there a solution to load cyrillic links via urlopen?
I made a script that creates a sitemap. Initially, I used urlopen(url) to load pages, but when an address like "site-address.rf" appeared in the url, an error appeared UnicodeEncodeError: 'ascii' codec can't encode characters in position 23-26: ordinal not in range(128) . I tried both urllib.quote_plus and urllib.quote, in this case it was ValueError: unknown url type: 'http%3A%2F%2F....'
In general, I solved the problem by using requests instead of urllib, there with requests.get (url) everything loaded without problems.
Just wondering how this problem is solved in urllib?
Answer the question
In order to leave comments, you need to log in
The domain name must be encoded in idna and transferred to urlopen already in utf8
from urllib.parse import urlunparse, urlparse
from urllib.request import urlopen
url = 'http://сайт.рф'
scheme, netloc, path, params, query, fragment = urlparse(url)
url = urlunparse((scheme, netloc.encode('idna').decode('utf8'), path, params, query, fragment))
r = urlopen(url)
print(r.read()[0:50])
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question