C
C
cat_crash2012-12-18 10:05:54
Algorithms
cat_crash, 2012-12-18 10:05:54

Web page hash calculation (HTML)

Good day.

I ask habrasobshchestvo ideas for the implementation of the algorithm. The essence of the task is as follows: there is a web spider that collects html pages with content. To avoid duplicate pages (for example www.example.com and www.example.com/index.php ) you need to calculate its hash (md5, any other) to be sure that a similar page is already in the database.

It seems that the task is simple and easily solved in the forehead than a thread like md5 (file_get_contents ('http://www.example.com')) BUT it happens that literally 2-3 characters do not match (for example, an openx ad manager generates different banner IDs on server side). Accordingly, md5 will be fundamentally different. It may also be that the number of characters will also be different (for example, a banner ID can be 5 characters and 1 character).

The main task of the hash is to avoid duplicate pages, provided that there can be hundreds of thousands of pages in the database.
What is the algorithm for QUICK SEARCH in the database, taking into account that the similarity of pages can be 100-90%

The pages that the spider processes can be completely different and “dynamic” code inserts are not amenable to any algorithmization. Those. you can't cut them out of their HTML stream with some kind of regexp.

Answer the question

In order to leave comments, you need to log in

5 answer(s)
V
Vyacheslav Slinko, 2012-12-18
@KeepYourMind

we divide the pages into blocks of a fixed size,
hash the blocks
, compare the hashes of the blocks of two pages
in order, as a result, we have:
true, true, true, true, true, true, false, true, true, true, true, true, false, true
the whole task in a suitable coefficient.

C
CleverMouse, 2012-12-18
@CleverMouse

Read an article from Yandex just about this issue: download.yandex.ru/company/download/paper_65_v1.pdf . And links from there.

B
brevis, 2012-12-18
@brevis

Maybe something like this - similar_text() and levenshtein() ?

A
aktuba, 2012-12-18
@aktuba

We get only the content (it’s not so difficult to do this, with assumptions, of course) and we calculate the hash based on the received content.
To get content on the github, there are ready-made libraries for all languages ​​already, you just need to take it and modify it for yourself. Here is an example: aktuba.com/api.php

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question