How to optimize API for Python parser?

I

Itvanya2015-12-24 22:05:12

Python

Itvanya, 2015-12-24 22:05:12

Friends, hello! As a refresher, I am writing a small multi-threaded parser, as well as an API for accessing it via the web. The resource from which I will parse provides its own API, which is very limited for security purposes. And I want to parse everything and give data in xml / json. I will use request + lxml + postgresql + nginx + uwsgi + standard threading module for multi-threaded request and page parsing as tools. The question is how to cache data in the database so that when a similar request is made, the data is taken from the cache. Do I need to take the Last-Modified headers from the server response and compare them with each new request in order to cache without errors?
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

Vladimir, 2015-12-25
@vintello

rethink your outlook on life. why step on a rake that someone has already stepped on. look at ready-made solutions. for example Scrapy . A great parsing library with its own server

S

sowrong, 2015-12-25
@sowrong

As tools, I will use ... the standard threading module for multi-threaded requests and page parsing.

Silly. Looks inefficient. Multithreading in python is a very slippery topic.
Parsing of web resources basically shuts up waiting for a response from the resource from which the pages are loaded. And this waiting time is comparable, and sometimes even exceeds the page parsing time.
Try a better asynchronous approach. Asyncio for python >=3.3, gevent for 2.x (other options are available to your liking)
As for the question itself, there is too little information for a normal answer. If headings can be relied upon for this particular resource, then why not? And if you can't? We do not see on the other side of the monitor what else is there. What is the data there? How often do they change? It may be possible to understand by what criteria it is possible to cache by studying the data itself (and not the server responses). Perhaps the answer is in the API documentation of the resource. Or, in general, you will have to determine it by eye, look for hit or miss and adjust the conditions ...
In general, look, study, and make decisions :)