Answer the question
In order to leave comments, you need to log in
How to write a universal parser for many monotonous sites?
Good day.
The task is to write a parser for about 30 monotonous ad sites.
All sites have a different site structure and different data sets - somewhere there is more data, somewhere less.
A parser for pulling out data for any need to be written specialized for each site, right? Or are there any other options?
And what about the DB?
For example, I now have a parser ready for one site. There are category entry URLs, upon entering which the parser pulls pagination URLs for certain categories.
Input URLs:
Table: site_category Category pagination
URLs:
Table: site_pagination
Next, the pagination URLs are parsed and the URLs of the ads themselves are pulled from them.
Table: site_item
What if the next site has a different structure?
Answer the question
In order to leave comments, you need to log in
I'm just dealing with this problem.
Now you can download only https://github.com/victorvsk/apify-core
But there is already a server, a client (admin panel), and an ansible/chief/puppet recipe is being prepared, where, in principle, only vps will be needed, and the syntax is like:
{
"index": {
"url": ["https://github.com/blog"],
"js": false,
"paginate": [
"\\/?+$",
"/?page=<% 1,2,1 %>"
]
},
"posts": {
"from": "select('h2.blog-post-title a') from('index')",
"js": false,
"host": "http://github.com",
"pattern": {
"title": "<% .blog-title %>",
"meta": {
"calendar": "<% .blog-post-meta li:first %>",
"author": "<% .blog-post-meta .vcard %>",
"category": "<% .blog-post-meta li:last %>"
},
"body": "<% .blog-post-body %>"
}
}
}
{
"html": "<html><head></head><body><div id='text'>Текст</div></body></html>",
"pattern": { "title": "Значение", "title-2": "Это: <% #text %> <% html |first |html %>", "text-html": "<% #text | first | html %>" }
}
I have a lot of jambs in parsing, yesterday I attacked a webinar, maybe you can sign up for those who do not constantly learn from him https://dmitrylavrik.ru/php-parser
If without running scripts, imports and applying styles, then we can easily take any ready-made xml parser and feed pages to it.
Then it remains only to find the necessary data in the resulting structure.
We have made a "learning" parser. He does not care about the structure of the site. He can get information about the product from any store. Here is an online demo https://fetch.ee/en/developers/
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question