Answer the question
In order to leave comments, you need to log in
Is it possible to write a universal site parser?
Good afternoon.
There was a need to write a site parser, universal.
The task is the following.
The user enters the site address in the form field, receives the site content via cURL, select only text from the output (from div, p, table, span, etc.).
But here's the question. Each site is an individual structure. How in this case, having configured the parser only once, to receive data from any site without changing the parser settings for each site? Is it possible?
Now I use php, cUrl, htmlpurifier to get text from the site.
p.s. What do you need to get?
It is necessary to get only the text, preserving the spelling and punctuation. There shouldn't be any tags. You don't need to get js/jquery either, the only thing you need to get from the data loaded by js/jquery is the content of the sliders, if there is text there.
Ideally, you need to get all the text from and only the text. If the text, for example, is in a table, then you need to select the text from td, write it in one line and save it to a file (base). The next line in the table tr > all td - form a line and add to the file (base). As a result, it should turn out that the entire contents of one table is one paragraph in the file. Same with the rest of the tags.
Something like this)
pss an attempt to implement the task<body> до </body>
Answer the question
In order to leave comments, you need to log in
Yes and no: you have a very vague wording. It is not clear how meaningful and processed the final result should be, how garbage is acceptable.
Downloading a page, building a document tree, and using some elementary heuristics to cut out the unnecessary (menus, sidebars, footers, ads, etc.) is relatively simple, but the result will be rather rough with an unsatisfactory signal-to-noise ratio.
To increase the versatility of the tool, it will be necessary to increase the number and complexity of these heuristics. And you can also connect machine learning there so that they improve themselves.
And now you already want to write something like a search spider. Imagine how much effort has been invested in the development of the Yandex or Google spider. Do you have such opportunities? But it's not enough just to write it, you need to support it, follow the new standards...
Website parsing is a task that is easily solved by people, but poorly given to robots. From a business point of view, it is much cheaper and more efficient to put a junior who will write separate parsing rules for each site than trying to compete with Google.
I highly doubt this is possible.
You also need structured data, and not just solid text of what is on the site / page. And to get structured data, you need to know and set up the structure for the parser so that it knows what to take and what to skip.
Well, curl is not a panacea for all problems. For example, it will not be able to get the data that is loaded on the site using JavaScript (hint: only PhantomJS will help in this case).
Этим поисковые системы уже второе десятилетие занимаются. Вроде получается, но оцените трудозатраты.
Конечно, возможно, стандарт DOM предусматривает document.body.textContent как и у других DOM-элементов.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question