A
A
Anlight2015-12-25 08:56:04
Python
Anlight, 2015-12-25 08:56:04

Web crawling where to start?

I was interested in this section, but I can’t figure out where to start digging. As I understand it, you need to dig in the direction of the grab and scrapy libraries, but there is practically no information in Russian, and if it comes across, then it is properly outdated. There is also documentation, there is also documentation, but again, this is documentation, but it is training that is of interest.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
V
Vladimir, 2015-12-25
@Anlight

start with request to get the code from the site, and regexp for parsing
then beautifulsoup - you will see the difference and understand the value of a specialized library
then Scrapy - and also draw conclusions for yourself
after that go to the freelance exchange and take any order for parsing and do it for more tool you understand. it may even be a long-closed order. but the result is not making money but doing a real task.
after that, you can already offer yourself for little money, on the same freelance.
This is the path of a beginner Jedi. it will be difficult but interesting :)

N
Nerevar_soul, 2015-12-25
@Nerevar_soul

In Russian, you can search for articles on Habré. There are both about grab and about scrapy . But in general, it is necessary to know English at the level of reading documentation. Without this, it will be very difficult.
In English, by the way, there is a pretty good book. They mostly use beautifulsoup and standard Python modules. Which I think is better for a beginner. There is also a bit about scrapy.
And the best way is to take some site and parse some data from there. Everything that is not clear to look for in the documentation and on stackoverflow (if everything is bad with English, then the Toaster and various forums dedicated to python).

A
Ashot Ogoltsov, 2015-12-25
@Prenom

The simplest crawler can be easily rolled using a grab. Well, then dig, depending on the need. By the way, the author of this library is very responsive on forums, etc. Well, in addition, there are his author's articles on Habré (see everything from Habrovchan lorien).

D
devel787, 2015-12-28
@devel787

You might be interested in the report
"Alexander Sibiryakov - Frontera: a distributed robot for bypassing the Internet in large volumes"
https://youtu.be/hV929rp1YmI

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question