What is the best way to parse the solutions.fas.gov.ru database?

F

Fridary2016-04-03 16:21:08

PHP

Fridary, 2016-04-03 16:21:08

solutions.fas.gov.ru is a database of court decisions of the Federal Antimonopoly Service for Russia.
I need to download it all and upload it to ElasticSearch. This site has the ability to view the results in JSON and XML.
How would you recommend doing this better if there are more than a million documents here and loading the database in JSON / XML format at least 10 documents takes about 3 minutes (for some reason)?
I'm thinking of writing a script in python or php (I know php better) and actually my php method will be something like this:

// скачиваем все документы за 2 апреля (19 штук)
$content = json_decode(file_get_contents("http://solutions.fas.gov.ru/search.json?action=search&doc_date_finish=02.04.2016&doc_date_start=02.04.2016"));
// ..сохраняем..

Question: Is my method effective or are there better ways to solve my problem? Can I somehow do it through curl get and it will be faster to parse? python or php?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

R

Roman Kitaev, 2016-04-03
@fridary

eventlet
Like this:

import eventlet
eventlet.monkey_patch()
import requests


urls = ['http://solutions.fas.gov.ru/search.json?action=index&'
        'controller=documents&page=%s' % page for page in range(1, 29044)]

def fetch(url):
    return requests.get(url)
    
pool = eventlet.GreenPool()

for response in pool.imap(fetch, urls):
    # Клади ответ в ES
    print('gotcha')

Or aiohttp .