How to get a list of site URLs (more than 2 million pages)?

E

Eugene2017-03-02 18:12:07

Wget

Eugene, 2017-03-02 18:12:07

There is a site, you need to make a map on the filter block. There are definitely more than 1-2 million pages.
Essentially just a list of links in a text file is needed.
From the data there is:
1. Initial url https://www.site.com/category/
2. Pieces that should be in the URLs I need *tip-*, *vid-*, *shema-*, etc.
3. Pieces of which should not be in my URLs *page=*
items 2 and 3 apply both to the list of url on which links are searched, and to the list of the final url.
4. There is a VPS where you can put a copy of the site and run the scanner.
How to solve the problem? It seems possible through wget, help me draw a wget request.
Initially I did it through the contentdownloader, but after 1 million links it can fall out of memory there.
There is also a php + database option that will check the relevance of links at the required frequency, add new ones, delete old ones, and, if necessary, upload the actual URLs for the current day at the right time. But this is also labor-intensive, if only on the basis of a code that is 95% ready to rewrite for yourself.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dimonchik, 2017-03-02
@dimonchik2013

easiest - Scrapy is the
cheapest - Wget, after Wget still process + single-threaded + xs which crawling algorithm