How to run python html parser in multiple threads?

M

Mark Adams2016-04-07 12:12:18

Python

Mark Adams, 2016-04-07 12:12:18

There is data (> 1k lines), for each line there is a request urllib.request.urlopen(). I split the base into 30 parts. Now we need to simultaneously run 30 identical scripts for each part. How to implement it?
I tried to run 2 scripts in the terminal at once, through "python_script_1&python_script_2", everything works fine, but what if there are 30 scripts? The most problematic thing here is the organization of directories (of course, I wrote os.path.abspath (os.curdir) in each script).
Please help with practical advice, there is no time to read about multithreading.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

asd111, 2016-04-07
@ilyakmet

The simplest multithreading:

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

urls = [
  'http://www.python.org', 
  'http://www.python.org/about/',
  'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
  'http://www.python.org/doc/'  
  ]

# Make the Pool of workers
pool = ThreadPool(4) 

# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

#close the pool and wait for the work to finish 
pool.close() 
pool.join()

D

Dimonchik, 2016-04-07
@dimonchik2013

master MultiCurl , it’s quite easy to rebuild the names of the files you save, download
, then go through the directory and parse, you can even without any multiprocessors, the main waste of time is getting from a remote server

S

sim3x, 2016-04-07
@sim3x

cat list | parallel -j 30 ./script.py {}