G
G
Grigory2017-03-31 14:48:33
Python
Grigory, 2017-03-31 14:48:33

How to design a "real-time" multi-site parser in Python?

Dear colleagues! Please push me in the right direction.
Let's say the task is:

a Python site contains a search bar into which the user can type arbitrary text.
When the user submits a request, the site should parse the latest headlines from several local news sites, and if there is a match, return the text of the matching headlines and links to relevant news to the user.

The question is very general: how to approach the design of the architecture of such an application?

A couple of more specific questions:
- if each site has its own small parser script, will they be run sequentially or in parallel? If parallel, how is it done?
- how (in general terms) to implement the return of search results to the user as the sites are parsed?
(There is a very lack of senior's theoretical training)
I would be grateful for any, including general considerations and links to something to read on the topic!

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Astrohas, 2017-03-31
@gpetrov

Real-time parsing is an evil that will slow down your entire system. Isn't it easier to index sites into a database and return results from there? You can run parsing tasks for example once every 10 minutes, which will be enough.
Concurrency is usually done by threads. You can read here https://habrahabr.ru/post/229767/ , https://habrahabr.ru/post/78267/
.
To organize tasks, you can use some kind of target

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question