J
J
jaffrey2016-11-20 21:34:05
Programming
jaffrey, 2016-11-20 21:34:05

How to parse site pages?

You need to parse values ​​from a bunch of site pages and write them to MySQL. The divs where the values ​​are located are known in advance and their id / class does not change (the pages are static, just the information is different everywhere). Please tell me the easiest way to do this at the present time (maybe there are some tools that simplify this).
I am superficially familiar with PHP, so links to materials on parsing requests / responses, etc. are very welcome.
Thank you.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
B
bnytiki, 2016-11-20
@bnytiki

You are the 4th one this week.
But since you don't know how to use search, then...
scrapy, for example, is designed for this (to get information from sites, but writing to MySQL is a separate task that scrapy does not solve).
https://scrapy.org/
But this is for Python.
There is for Go
https://github.com/PuerkitoBio/gocrawl
https://github.com/PuerkitoBio/goquery
Surely there is something similar for PHP.
And you can also use ready-made services:
80legs, Mozenda.
They will rob everything according to your order, give it to you in a convenient form - you will then write down from this form where you need it.
They have free trial plans.

E
Elena Stepanova, 2016-11-21
@Insolita

guzzle + phpQuery/nokogiri

A
Artyom, 2016-11-21
@Llaminator

You take python, you take xml, you watch tutorials, you're done.

A
APaMazur, 2016-11-24
@APaMazur

I would say that PHP is not the best solution for the task.
First, you need to see if the resource has a normal AJAX interface, this can be seen in the console
. If not, and you need to parse, then the correct approach, probably for today, is Python + requests + BeautifulSoup (there are alternatives, but this one definitely works and works well)
Install Python (I prefer 2.7, but it's unimportant)
Install requests and BeautifulSoup
Install lxml
Next, write something like this

import requests
from bs4 import BeautifulSoup

page = requests.get('http://www.mysite.com/1').content    # Получаем данные
page = BeautifulSoup(page, 'lxml')    # Приводим данные к красивому виду
parsedData = page.findAll('div', {'class': 'my-data-class'})    # Выбираем теги по атрибутам (для примера взят класс)

Upload data, if there are not very many of them, you can, for example, in csv
csvfile = open('myfile.csv', 'wb')
writer = csv.writer(csvfile, delimiter=';', quotechar=';', quoting=csv.QUOTE_MINIMAL)
for row in parsedData:
    writer.writerow(row)
csvfile.close()

RegExp and string operations may also be needed, but this is also simple and easy to google

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question