How are pages crawled in scrapy?

A

Anton Misyagin2019-08-08 10:46:38

Python

Anton Misyagin, 2019-08-08 10:46:38

Good afternoon everyone. I want to understand how scrapie downloads. In my project, I first used several start pages in the settings. As I understand it, it bypasses links from these pages by downloading several links in parallel at a time. It goes, then - bam - stop. What happened? Who said that's all? It seems like he went around everything, and maybe not all the pages that I told him. But I reassure myself that's all. Then I run it on a schedule, and if pages are added to the site, then I get new data. But it's hard to describe the list of start pages well, so I rewrote the start page generation block. Now I'm getting them from the sitemap. And now I have millions of pages when I start the scrape task. If you run it with so many start pages - will it not be interrupted until it finishes everything? How to control the number of crawled pages per scrapie run? How are you doing? Do you save already bypassed ones to a file/base or can scrapie do it itself? By the time we go through all the links, the information will become outdated. It makes sense to download the map - start walking, we go for a week, for example, then we transfer the map and go on a new map. Has anyone experienced similar thoughts? How is it decided?