Python - referenced before assignment in the process of parsing a site, how to look for an error?

D

dedal_02019-01-13 21:35:16

Django

dedal_0, 2019-01-13 21:35:16

Good day to all. I am writing a parser for a website. On Django, using the requests and BeautifulSoup libraries. Simple but long lasting. Collects information, stores information through models in the database.
The crux of the matter is that in the process of work it is necessary:
- to collect a list of main objects from the first html page, go through the n-th number of pages of this first page, complete the list, collect information, save it to the database
- then go to the html page of each of the objects , collect information, save, if some conditions are met - go to the next html page, "work" there
Access to url's is done using requests. Content parsing is done through BeautifulSoup.
Sometimes the url does not provide information, requests.exceptions.ReadTimeout or ConnectionTimeout exceptions are fired.
I had to construct something like this for each request to the url:
`

import requests
from requests.exceptions import Timeout
from bs4 import BeautifulSoup as bs
MIMIC_HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
...
read_fail = True
while read_fail:
    try:
        sleep(1)
        response = session.get(start_url, timeout=10, headers=MIMIC_HEADERS)
        html_bs = bs(response.content, 'html.parser')
    except Timeout:
        read_fail = True
    except UnboundLocalError:
        read_fail = True
    finally:
        read_fail = False
        ...(прочие действия над html_bs)
`

This design did not appear immediately. An UnboundLocalError exception was thrown when trying to work with response or html_bs variables.
All this is done periodically using django-carrot (if someone is not familiar - a lighter analogue of celery).
I had to abandon celery itself, because in the current version (4.2) a serious bug was found - it overflows the message queue when working with scheduled tasks.
Sometimes the task can last up to 8 hours. Collect over 2000 objects. With a good scenario.
The last surprise was the premature end of the task with the message `html_bs referenced before assignment`. Prior to this, the task crashed with an UnboundLocalError exception.
Sometimes lines like this appear in carrot's log:
Unable to find MessageLog matching the uuid . Ignoring this task The
broker used is rabbitmq.
Question 1 - did I approach the solution of the problem correctly?
Question 2 - if the error is not in this miracle design, but with, say, when requesting a url - how to find out what the error is if the logs of the task, carrot and even rabbitmq are silent on this matter
Operating system - Debian 9
Version of python - 3.6
Framework - Django 2.1

List of installed packages in the virtual environment:

amqp==2.3.2
asn1crypto==0.24.0
Babel==2.6.0
backcall==0.1.0
beautifulsoup4==4.6.3
billiard==3.5.0.5
celery==4.2.0
certifi==2018.11.29
cffi==1.11.5
chardet==3.0.4
cryptography==2.4.2
decorator==4.3.0
Django==2.1.4
django-carrot==1.3.3
django-compat==1.0.15
django-grappelli==2.12.1
django-timezone-field==3.0
djangorestframework==3.9.0
fake-useragent==0.1.11
gevent==1.3.7
gevent-eventemitter==2.0
greenlet==0.4.15
gunicorn==19.9.0
idna==2.8
ipython==7.2.0
ipython-genutils==0.2.0
jedi==0.13.1
json2html==1.2.1
kombu==4.2.2
parso==0.3.1
pexpect==4.6.0
pickleshare==0.7.5
pika==0.12.0
Pillow==5.3.0
prompt-toolkit==2.0.7
protobuf==3.6.1
psutil==5.4.8
psycopg2-binary==2.7.6.1
ptyprocess==0.6.0
pycparser==2.19
Pygments==2.3.0
python-crontab==2.3.5
python-dateutil==2.7.5
pytz==2018.7
requests==2.20.1
selenium==3.141.0
six==1.11.0
tornado==5.1.1
traitlets==4.3.2
urllib3==1.24.1
vdf==2.4
vine==1.1.4
wcwidth==0.1.7

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

dedal_0, 2019-01-15
@dedal_0

Exactly. I didn't read the documentation carefully. The code in finally will be executed under any conditions, and not on successful execution of try, as it seemed.
Thanks to all.