J
J
JRazor2014-02-11 13:55:04
Python
JRazor, 2014-02-11 13:55:04

Question for experienced Pythons and Scrapers?

Hello. There was a problem of not understanding Scrapy. I used to write parsers in lxml and the like, but now I decided to try asynchronous Scrapy. There were questions that I can’t find an answer to, because, apparently, I don’t understand the technology:
1) I need to parse the site first into categories, then each category into a subcategory, and then also parse these subcategories. In this regard, the question is brewing - how is this case usually processed in Scrapy? Is everyone pushing into one spider or summoning a spider from a spider?
2) I noticed the practice of pushing all the Scrapy code into one file. Quite practical. In theory, this should not affect performance in any way. This is true?
3) Scrapy is quite slow. Does it have settings that speed up the parsing process?
Thank you very much in advance.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Stepan, 2014-02-11
@JRazor

Use GRAb.
Your questions are strange. It all depends on the structure of the site you are parsing.
The behavior of your spider depends only on where you direct it yourself.
In your example, you need to write only 3 tasks:
1. Parsing a category
2. Parsing a category into subcategories
3. Parsing subcategories data.
Why is it convenient in 1 file? Yes, because everything is interconnected.
We start the spider, it starts task1 and passes the categories to task2.
Task2 parses subcategories and passes them to task3.
As for the speed, it all depends on the site you are parsing. My 200 thread spider easily parsed 5 million pages in less than an hour.

J
Jaz Bek, 2016-12-05
@smile_desu

By the way, I have a question. And how can you make requests to the built-in search engine in the site itself. When requesting, for example, azb, a list is displayed where there are closer options with this name, but limited to only 50 positions. How to set up requests for this search engine to rustle the entire database. Also, the problem is that in the browser, when entering a request, the internal search engine searches for a request for about 2-5 seconds.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question