Site parsing. How would you do?

A

adbm172019-08-26 15:10:28

Parsing

adbm17, 2019-08-26 15:10:28

Good day!
I ask for help in solving one problem in the parsing of the site.
The purpose of parsing is any site of a betting company, for example 1xstavka.ru
The task is parsing the pre-match line (type of sport, tournament, start time, participants, odds (maximum number of markets)) and recording the received data in the database
. It is also necessary to receive data on changes as quickly as possible odds and new events.
Questions of interest:
1. What technology stack would you use to solve the problem? (what is the best language for writing code, necessary libraries and frameworks)
2. How to get around a possible ban on ip due to a large number of http requests to the site?
3. How can you get the fastest possible data on the change in the coefficients in all sporting events on the site?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

I

Ivan Yakushenko, 2019-08-26
@kshnkvn

1. Python
2. Proxy. If there is an opportunity to spend a little money - luminati (fast servers, a large number, I use them to scoop out information from 3 sports sites every minute). If there is no money - write a parser for proxy sites, filter these proxies for a specific site and make a proxy rotator so that each request is sent from a different IP.
3. Ideally, parse not the site, but requests. Go to the site, open dev tools - network and see what requests the site sends / receives. Very often, such sites use something like an API and you will be able to load json/xml/etc for matches, which will greatly speed up parsing. If not, then we return to point 1 and add lxml there to parse the site.

R

Roman Andreev, 2019-08-26
@RexFack

The Python language, the best version of the parser in my personal opinion is BeautifulSoup, you can bypass the ban through a proxy, i.e. these same bookmakers are constantly banned in Russia, you will either have to constantly change the addresses of the bookmaker or use foreign proxies.
Here are some good tutorials on BeautifulSoup that teaches parsing from A to Z up to recognizing the numbers in the picture from the information that the script parsed. proglib I did not find it better, maybe I was looking badly.
upd: Also, proglib in that course tells how to bypass bans)