How to write a universal parser for many monotonous sites?

G

German Zvonchuk2015-01-05 22:56:31

PHP

German Zvonchuk, 2015-01-05 22:56:31

Good day.
The task is to write a parser for about 30 monotonous ad sites.
All sites have a different site structure and different data sets - somewhere there is more data, somewhere less.
A parser for pulling out data for any need to be written specialized for each site, right? Or are there any other options?
And what about the DB?
For example, I now have a parser ready for one site. There are category entry URLs, upon entering which the parser pulls pagination URLs for certain categories.
Input URLs:
Table: site_category Category pagination
URLs:
Table: site_pagination
Next, the pagination URLs are parsed and the URLs of the ads themselves are pulled from them.
Table: site_item
What if the next site has a different structure?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

V

Viktor Vsk, 2015-01-05
@viktorvsk

I'm just dealing with this problem.
Now you can download only https://github.com/victorvsk/apify-core
But there is already a server, a client (admin panel), and an ansible/chief/puppet recipe is being prepared, where, in principle, only vps will be needed, and the syntax is like:

{
  "index": {
    "url": ["https://github.com/blog"],
    "js": false,
    "paginate": [
      "\\/?+$",
      "/?page=<% 1,2,1 %>"
    ]
  },
  "posts": {
    "from": "select('h2.blog-post-title a') from('index')",
    "js": false,
    "host": "http://github.com",
    "pattern": {
      "title": "<% .blog-title %>",
      "meta": {
        "calendar": "<% .blog-post-meta li:first %>",
        "author": "<% .blog-post-meta .vcard %>",
        "category": "<% .blog-post-meta li:last %>"
      },
      "body": "<% .blog-post-body %>"
    }
  }
}

(This "code" downloads all the posts from the first two pages of the github https://github.com/blog)
The code is in ruby, but it is conceived as a standalone daemon, so you can either participate or wait for the finished solution.
The biggest difficulty, in the near future, is probably in the normal documentation.
PS In general, the point is, there are a bunch of parser instances on different servers (or on one, or locally, it doesn’t matter) and there is an admin area where you create entities (units) in which you describe the structure with such a pseudo-syntax (what to parse from) and in the end, specify the url where to send the finished result
UPDATE: Posted a crooked piece on heroku
If you're interested, you can experiment. Of course, it's hard without documentation, but perhaps something will work out usingexamples :
For example, the json above can be sent to:
POST request, Content-type: application/json
And to check the syntax directly, you can pass not links, but html. For example, here is the json:

{
"html": "<html><head></head><body><div id='text'>Текст</div></body></html>",
"pattern": { "title": "Значение", "title-2": "Это: <% #text %> <% html |first |html %>", "text-html": "<% #text | first | html %>" }
}

POST request to https://agile-river-6704.herokuapp.com/parser?apif... It 's
important to include ?apify_secret=secret in the address.
maybe someone will be interested.

S

Stasy_sin, 2015-09-11
@Stasy_sin

I have a lot of jambs in parsing, yesterday I attacked a webinar, maybe you can sign up for those who do not constantly learn from him https://dmitrylavrik.ru/php-parser

D

Dmitry Demin, 2015-01-05
@keksmen

If without running scripts, imports and applying styles, then we can easily take any ready-made xml parser and feed pages to it.
Then it remains only to find the necessary data in the resulting structure.

A

Alexander Polyakov, 2016-11-02
@silenzushka

We have made a "learning" parser. He does not care about the structure of the site. He can get information about the product from any store. Here is an online demo https://fetch.ee/en/developers/