Answer the question
In order to leave comments, you need to log in
How to quickly compare two urls?
Good day!
The task is to compare new url files that enter the system for uniqueness. the number of unique urls in the file is several hundred thousand. As I understand it, comparing strings completely is not very efficient. There was an idea that it is possible to compare hashes of these lines. Collisions can be neglected. What database to choose for hash storage? I did not find the ability to store hashes in sqlite and it was enough to quickly compare the hash of the url to compare with the hash of the url. Maybe there are some ready-made options? Or do I need to use a different database?
Answer the question
In order to leave comments, you need to log in
tutunak And I have to compare each of them with 100,000 previous ones (+ the same file will increase with each new unique url).
That's how databases have indexing, they can build trees that optimize the search ...
>> There was an idea that you can compare the hashes of these strings.
Why write a bicycle if the bases do it for you?
I'm sorry, I blunted specifically. Anything before that doesn't count.
If all the data is in the file, then as I understand it, we can read it using the following code:
After we read it, we don’t even need to split it into an array of strings. There is a wonderful find function. If the string is found, it returns the character number from which the string begins; if it is not found, it returns -1.
The entire code will look like this:
url = 'http://google.ru/'
urls = open('urls.txt', 'r').read()
find = urls.find(url)
if find==-1:
urls.close()
urls = open('urls.txt', 'w').write(url)
else:
#тут код который вызываем при нахождении урла в базе.
Judging by the description, you need to make a primitive in the form:
x = []
for i in data.split(" "):
if i not in x:
x.append(i)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question