N
N
n1kto312020-10-03 23:40:40
Parsing
n1kto31, 2020-10-03 23:40:40

How to find a word on a multi-page site?

Good day. You need to find all the pages on the site that have the search word. For example, I need to find all the pages on the Habr site that contain the word "question". How can this be implemented? Just explain how for a teapot, please.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
L
LaraLover, 2020-10-03
@LaraLover

Option 1
You open the site, look for its sitemap, it may be in robot.txt. you open each page and look for an entry to the right word. Ideally, write down all links and parse each page for links and save them in the database
Option 2
Through site:habr.com "question"
You search in Google, get all INDEXED pages, save them to the database and parse.

A
Alexey Gnevyshev, 2020-10-04
@iResource

As already noted, you can analyze the sitemap, if it exists. This is usually the sitemap.xml file at the root of the site, but the file may be different.
Getting a list of pages from the map, you can automate their scanning.
This script is designed just for this: blog.inform-resource.ru
I used it more than once, it works well. It probably can help you too.

U
Uno, 2020-10-09
@Noizefan

Sitemaps, etc. are wonderful of course.
You need to implement a crawler program - a spider that receives one link to the site as input, then parses all internal links from this page, repeats the same thing with each of them and so on until full indexing.
Voila - you have a database with all the pages of the site, do whatever you want.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question