What is the easiest way to collect a bunch of links from a site?

S

Sergey Markov2018-04-18 08:38:41

PHP

Sergey Markov, 2018-04-18 08:38:41

Tell me, please, a wildly undemanding site parser in php. The task is utterly primitive, you need to collect links to products and descriptions from the online store. I do this not to steal content, but to create an XML file for Regmarkets.
In general, I successfully did this using simple_html_dom but on a different computer. Now only a mega-weak and old machine is available, as a result of which the library tries to do something for about five minutes and to no avail. "Gag" occurs at the stage of parsing the code and searching for the necessary tags in it. Tried on Denver and OpenServer, does not depend on the server.
Perhaps it's worth writing from scratch, but I've never done parsers and it's probably faster to use a ready-made solution, but it should be some very simple one. It is necessary: to get links to products from the catalog, go to each link and take the description from the desired div there, save it all in excel.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

K

Kirill Gorelov, 2018-04-18
@Kirill-Gorelov

Don't know. On the contrary, I have never used simple_html_dom. And I write everything on a regular basis. I find it very convenient and fast.

E

Eugene, 2018-04-18
@Eugeny1987

Content Downloader parses well You
get links from the sitemap

P

Pychev Anatoly, 2018-04-18
@pton

If you need to save all this in excel and on the local machine, then I would do it directly with excel
tools Using the WinHttp.WinHttpRequest.5.1 tool, we get page data

spoiler

'---------------------------------------------------------------------------------------
' Purpose   : Стучимся в сервер за результатами
'---------------------------------------------------------------------------------------
' sQuery - строка запроса
' sResponse - ответ, передается по ссылке
Function Runhttp(sQuery As String, ByRef sResponse As String) As Boolean
   On Error GoTo ErrorHandler
   Dim oHttp As Object
   Dim s$, h$, FileName As String
   Dim v As Variant
   Set oHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
   
   
   With oHttp
      .Open "GET",  sQuery, False
      .SetRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.137 YaBrowser/17.4.1.955 Yowser/2.5 Safari/537.36"
      .SetRequestHeader "Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
      .SetRequestHeader "Accept-Language", "uk,ru;q=0.8,en;q=0.6"
      .SetRequestHeader "Connection", "keep-alive"
      
      .Send ("")
      
   End With
   
   If oHttp.Status = 200 Then
      sResponse = oHttp.responseText
      Runhttp = True
   Else
      sResponse = oHttp.Status
      Runhttp = False
   End If
   
   
ErrorExit:
   Set oHttp = Nothing
   On Error GoTo 0: Exit Function
ErrorHandler:
   If Err.Number = -2147012889 Then    ' Ошибка нет соединения
   End If
End Function

as a result, in sResponse we have the complete html
of the requested page. Next, we parse. I would recommend the code from here Working with HTML
Well, then write yourself links or whatever you parse directly into an Excel sheet
. It will probably be slower than in php, but you said that such a task does not happen often. So he started and went to dinner. You can first collect links from pages, and the second step is to extract data from the links (the algorithm is the same).
There is one caveat, if the OS is Windows XP and below, then not all https are read. Http is easy to read, https is version 1.1 only. This is an OS issue.
With Uv. Anatoly