Answer the question
In order to leave comments, you need to log in
What is the architecture of a "standard" parser?
Good day.
Recently I had to write a parser for one site. The requirements were simple, the task was done and forgotten. Recently a friend asked me to rob another site. Of course, part of the code migrated to the new project.
Question: what are the "standard" (common, common, frequent) requirements for parsers, and how are they reflected in the architecture?
Answer the question
In order to leave comments, you need to log in
1. Parallel loading and data processing streams
2. Error level control to continue or abort resource processing.
3. Data processing and segmentation from erroneous and invalid structured data (eg HTML/XML).
4. "Sieve" (rules) to prevent further processing of the resource based on the data already received (the conditions of the algorithm are written in the config).
For example, content larger than 5 kb with the word "toster" or url contains "toster.ru" - skip and move on to processing the next one.
There are various libraries and ready-made solutions based on which you can implement parsers, for example Grab
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question