Multi-threaded write to the database?

S

stayHARD2015-08-29 18:24:28

Python

stayHARD, 2015-08-29 18:24:28

I'm trying to make a multi-threaded script that takes information from the database - Postgresql (links to websites) goes through them, collects information and puts it back into the database.
My sketches:

iimport urllib2
from bs4 import BeautifulSoup

import psycopg2
import threading

def scrape(link, id):
  # print link, id
  # connect to database
  connection = psycopg2.connect(database = "contacts", user = "???", password = "???", host="localhost", port="5432")
  # create new cursor
  curs = connection.cursor()

  # headers for opening links
  hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'}

  # try open link
  try:
    # connect to the page and get request
    req = urllib2.Request(link, headers = hdr)
    page = urllib2.urlopen(req, timeout = 30) # todo timeout
    html = page.read()

    #  get soup/html code of the page
    soup = BeautifulSoup(html, "html.parser")

    # finding title on the page
    try:
      title = soup.find('title')
      title = title.text
      print 'TITLE: ' + title
    except:
      title = ''
      print "Can't get title!"

    # finding meta keywords on the page
    try:
      meta_keywords = soup.find('meta', attrs = {"name" : "keywords"})
      meta_keywords = meta_keywords['content']
      print 'META KEYWORDS: ' + meta_keywords
    except:
      meta_keywords = ''
      print "Can't get meta keywords!"

    # finding meta description on the page
    try:
      meta_description = soup.find('meta', attrs = {"name" : "description"})
      meta_description = meta_description['content']
      print 'META DESCR:' + meta_description
    except:
      meta_description = ''
      print "Can't get meta description."



    # update database with new information
    query = "UPDATE app_contacts SET visited = %s, title = %s, meta_keywords = %s, meta_description = %s WHERE id = %s AND url = %s;"
    data = ("1", title, meta_keywords, meta_description, id, link[7:])
    curs.execute(query, data)
    connection.commit()
    connection.close()
  
  except:
    print "Can't open link!"




if __name__ == '__main__':
  conn = psycopg2.connect(database = "contacts", user = "???", password = "???", host="localhost", port="5432")
  c = conn.cursor()
  c.execute("SELECT id, url, role from app_contacts WHERE url!='' AND visited='0' order by id;")
  for item in c.fetchall():
    link = "http://" + item[1]
    id = item[0]
    t = threading.Thread(target = scrape, kwargs={"link":link, "id":id})
    t.start()

Actually, the problem is in the recording, after starting the script works out very strangely and refuses to write all the collected information to the database. What am I doing wrong?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

angru, 2015-08-31
@angru

I understand it's a matter of speed.

You create a new connection to the database for each link, you can’t do this - it takes a very long time. Create it once at the start of the script, or better yet, a connection pool right away (there is even a ThreadedConnectionPool if you want threads so badly).
Do not use print, from experience - if you print a lot - this greatly slows down the execution of the script. Better use the standard logging module
It is better to use thread pools and queues. Instead of downloading thousands of links in a thousand streams at once, it is better to download them in small parts. You just need to play around with the queue and pool sizes and choose the best values.
If the files are large, it's better to use a SAX parser like lxml for parsing instead of BeautifulSoup.
Like I said, consider using async frameworks (there is sample code in your last question) if after all the fixes you are still not satisfied with the speed. Because in this particular case, streams are of little use, you start them all at the same time and they will all be idle at the same time while the data is downloaded from the links

D

Dmitry, 2015-08-29
@EvilsInterrupt

/offtop: Before writing something multi-threaded in Python, I strongly recommend reading about the GIL. This is extremely important!!! On Habré there is a translation of an article from David Beasley about the GIL device. I strongly recommend reading.