S
S
Sergey Bard2017-05-15 11:29:52
Google
Sergey Bard, 2017-05-15 11:29:52

How to crawl a Sitemap?

Hello. I do a service for myself.
For the first time, when adding a site, I collect the following data on it: title, description, page load speed, etc. keeping everything in BD.
I get speed with these links

https://www.googleapis.com/pagespeedonline/v2/runPagespeed?strategy=desktop&url=site.com
https://www.googleapis.com/pagespeedonline/v2/runPagespeed?strategy=mobile&url=site.com

title, description and all other information I collect using phpQuery.
This was the first scan when adding a site, then I scan every day (with a crown) according to the same scheme and compare it with the indicators from the database, if there are changes, then I write it to the table of changes and do an update of yesterday's scan, I came up with this in order to track changes in sitemap every day and so that every day data on a new day is not written to the database, making the database bigger and bigger, but only fixing changes compared to yesterday.
Everything works fine!, but I did it all until today only on sites with a sitemap of no more than 1000 urls, now you need to do the same with sites in which there are > 100,000 links in the sitemap, and so I think if the scan of one site in which ~ 1000 takes quite a long time (I didn’t measure it exactly), so if I have 5 sites with a sitemap of 100,000 and 5 with 1000, then it takes a lot of time to scan them (a day will not be enough))).
Question 1: Who can tell me the best way to scan large sites ?, once every three days or once a week, maybe someone has other ideas, and how the operation of such a script will affect the server ?, if I run such a script for scanning> 500 000 links will be a big load on the server and xs how it will affect the sites that lie on it.
Question 2: How else can I get page loading speed? because most link crawling goes to Pagespeed
https://www.googleapis.com/pagespeedonline/v2/runPagespeed?strategy=desktop&url=site.com
https://www.googleapis.com/pagespeedonline/v2/runPagespeed?strategy=mobile&url=site.com

it takes a very long time to get results.
And I ask for advice on the general scheme, so do it normally or if there are better options. I'd love to hear from everyone!

Answer the question

In order to leave comments, you need to log in

1 answer(s)
E
Eugene Volf, 2017-05-15
@Wolfnsex

Who will tell you how best to scan large sites?
Just don't laugh, but it's better to do this in C and / or in several threads, if we talk about the need for maximum performance of such processes.
and how will the operation of such a script affect the server?
You can control the maximum load of the process on the server (machine), for example, with the help of (re)nice .
Question 2: How else can I get page loading speed?
The exact same result as GPS (Google Page Speed) - no way. Your result, no matter how you get it, will differ from the GPS result for a number of reasons (I think they are pretty obvious to voice them). But in general, the process is quite simple in its logic:
0. We determine what we want to get, the download speed of the entire page or the download speed of the HTML code of the page.
1. Start the timer (for example, as described here for PHP
2. Download the HTML code
3. Scan all links on the page if we are interested and load them cyclically (if we need time for the page to fully load)
4. Stop the timer, get the result
PS What GPS uses is supposedly (most likely) based on the Chromium browser and does not work quite as easy as you might expect (this is another reason why your time and GPS time will differ, the question is more about what time you want to receive). Within the framework of the "toaster answer", (however, like any other answer), it is rather difficult to describe all the principles of operation of such schemes, this will require at least a series of articles and good knowledge of C / ++ from the questioner, with a very high probability (for the fact that would change Chromium's sources by itself, appropriately).

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question