M
M
Mark Rosenthal2015-02-03 03:34:01
Django
Mark Rosenthal, 2015-02-03 03:34:01

What python library to parse Html?

Hey!
There are N number of sites with approximately the same information, this data is displayed in a table, I want to collect all this on one site in one table. Well, you understand, like a news aggregator, or something else ...
Which library is better suited here?
How to be not banned in such activities?

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
Andrey Myvrenik, 2015-02-03
@gim0

beautifulsoup4 - www.crummy.com/software/BeautifulSoup/index.html

I
Ilya, 2015-02-03
@FireGM

For the third python Grub . I work with it, and inside I use sqlalchemy. It just comes out great.

A
Alexey, 2015-02-03
@gentee

Recently recommended for scraping sites is this solution
scrapy.org

T
throughtheether, 2015-02-03
@throughtheether

I was in a similar situation (there were about 10 source sites with different data structures) using requests , lxml and XPATH expressions.

How to be not banned in such activities?
If you use synchronous libraries (requests), then, in my opinion, you don’t have to worry too much about possible blocking if the servers hosting the sites are properly configured and you don’t access the sites too often. Just in case, you can prescribe an inconspicuous User-Agent.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question