Creating a spider robot to collect data - where to look for information?

W

Wade2k2017-07-25 22:59:33

Algorithms

Wade2k, 2017-07-25 22:59:33

Now the task is to develop a spider to collect data from sites.
It is necessary to bypass sites, extract data and add it to the database.
Are there ready-made solutions and frameworks so as not to reinvent the wheel?
How were such problems solved?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

K

Konstantin Chuvanov, 2017-07-26
@Wade2k

If you don't mind the python-way, then Flask + BeautifulSoup + SQLAlchemy
A book dedicated to your question
Flask guide on Habré
BeautifulSoup guide in Russian
SQLAlchemy guide in Russian
It was enough for me to import bs4 and take data directly into views.py

from flask import render_template
from urllib.request import urlopen 
from urllib.error import HTTPError
from bs4 import BeautifulSoup

@app.route("/links/")
def parse():
  try:
    html = urlopen("http://www.site.ru/").read()
  except HTTPError as e:
    print(e)

  soup = BeautifulSoup(html, 'lxml');
  links = soup.findAll('a')

  return render_template('template.html', links=links)

Season with database alchemy and a RESTful microservice is almost ready, or build an entire web application around it, Flask allows.

A

Anton, 2017-07-25
Reytarovsky @Antonchik

Read about piercing, and about libraries that are convenient to parse html