Answer the question
In order to leave comments, you need to log in
How to find a word on a multi-page site?
Good day. You need to find all the pages on the site that have the search word. For example, I need to find all the pages on the Habr site that contain the word "question". How can this be implemented? Just explain how for a teapot, please.
Answer the question
In order to leave comments, you need to log in
Option 1
You open the site, look for its sitemap, it may be in robot.txt. you open each page and look for an entry to the right word. Ideally, write down all links and parse each page for links and save them in the database
Option 2
Through site:habr.com "question"
You search in Google, get all INDEXED pages, save them to the database and parse.
As already noted, you can analyze the sitemap, if it exists. This is usually the sitemap.xml file at the root of the site, but the file may be different.
Getting a list of pages from the map, you can automate their scanning.
This script is designed just for this: blog.inform-resource.ru
I used
it more than once, it works well. It probably can help you too.
Sitemaps, etc. are wonderful of course.
You need to implement a crawler program - a spider that receives one link to the site as input, then parses all internal links from this page, repeats the same thing with each of them and so on until full indexing.
Voila - you have a database with all the pages of the site, do whatever you want.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question