T
T
tutunak2016-06-30 12:17:41
Python
tutunak, 2016-06-30 12:17:41

How to quickly compare two urls?

Good day!
The task is to compare new url files that enter the system for uniqueness. the number of unique urls in the file is several hundred thousand. As I understand it, comparing strings completely is not very efficient. There was an idea that it is possible to compare hashes of these lines. Collisions can be neglected. What database to choose for hash storage? I did not find the ability to store hashes in sqlite and it was enough to quickly compare the hash of the url to compare with the hash of the url. Maybe there are some ready-made options? Or do I need to use a different database?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
A
Anton, 2016-06-30
@Gokudera

tutunak And I have to compare each of them with 100,000 previous ones (+ the same file will increase with each new unique url).
That's how databases have indexing, they can build trees that optimize the search ...
>> There was an idea that you can compare the hashes of these strings.
Why write a bicycle if the bases do it for you?

A
Alexey Milyuta, 2016-07-04
@Milyuta

I'm sorry, I blunted specifically. Anything before that doesn't count.
If all the data is in the file, then as I understand it, we can read it using the following code:
After we read it, we don’t even need to split it into an array of strings. There is a wonderful find function. If the string is found, it returns the character number from which the string begins; if it is not found, it returns -1.
The entire code will look like this:

url = 'http://google.ru/'
urls = open('urls.txt', 'r').read()
find = urls.find(url)
if find==-1:
   urls.close()
   urls = open('urls.txt', 'w').write(url)
else:
   #тут код который вызываем при нахождении урла в базе.

If there are several urls, then we cut them into an array, and we look for each in the database, the search speed goes off scale (if everything is written correctly).

I
Igor Nikolaev, 2016-06-30
@nightvich

Judging by the description, you need to make a primitive in the form:

x = []
for i in data.split(" "):
    if i not in x:
        x.append(i)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question