What engine to use for a search engine using html code?

R

rodion-dev2015-03-13 20:24:46

Python

rodion-dev, 2015-03-13 20:24:46

What engine to use for a search engine using html code?
those. for example, I need to request < a class="link-class-href"
to get all the data where this info is.
the volume is supposed to be about 1-10 billion documents
with such a volume the sphinx will cope well, but as far as I remember there is no possibility of searching by html.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

M

maaGames, 2015-03-13
@maaGames

Parse 10 BILLION pages in real time and in Java? Don't make fun of my slippers!
You need to index the tags (or classes, whatever you will be looking for) of all pages (offline, in any language), and only then look in these tables. Those. you will search not by the HTML itself, but by the database of tags (classes). Even assembler can't cope with a linear search in 10 billion pages. Unless, of course, the user is willing to wait a couple of hours before receiving the result.

R

rodion-dev, 2015-03-13
@rodion-dev

although 1 billion pages on one server, this of course will not work
, you need at least 20 nodes with a good configuration

O

Optimus, 2015-03-14
Pyan @marrk2

I know you're going to scrape ;) I wouldn't advise putting pages in the index, but I would advise you to visit the site every time because the data can change. On Habré, take a look at the Yandex blog where they have a parser parsing at a speed of 240 thousand per minute, which means that it will take you 69 hours for 1 billion. This is if you parse with the speed of C ++
If the average page weighs 30kb, then for 1 billion pages you will need 27 terabytes of space))

A

Alexander, 2015-03-13
Madzhugin @Suntechnic

xapian performed well when I tried it