What is the architecture of a "standard" parser?

O

Oxoron2015-08-13 22:32:52

Programming

Oxoron, 2015-08-13 22:32:52

Good day.
Recently I had to write a parser for one site. The requirements were simple, the task was done and forgotten. Recently a friend asked me to rob another site. Of course, part of the code migrated to the new project.
Question: what are the "standard" (common, common, frequent) requirements for parsers, and how are they reflected in the architecture?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

X

xmoonlight, 2015-08-13
@Oxoron

1. Parallel loading and data processing streams
2. Error level control to continue or abort resource processing.
3. Data processing and segmentation from erroneous and invalid structured data (eg HTML/XML).
4. "Sieve" (rules) to prevent further processing of the resource based on the data already received (the conditions of the algorithm are written in the config).
For example, content larger than 5 kb with the word "toster" or url contains "toster.ru" - skip and move on to processing the next one.

V

Valentine, 2015-08-14
@gephaest

There are various libraries and ready-made solutions based on which you can implement parsers, for example Grab

O

otetz, 2015-08-20
@otetz

There are also ready-made solutions, such as HTTrack .
And there are plenty of less automatic offline browsers.