Does such an xml or html parser exist?

I

itfan2020-10-30 21:49:19

HTML

itfan, 2020-10-30 21:49:19

Different languages have libraries for parsing html or xml. Is there a ready-made product for this case? Let's say I download 1000 html pages with some Teleport Pro. Then I indicate to the program a folder with files and a template for sampling. For example, take the contents of headers or lists from each file. Has anyone seen this? I'm also interested in a ready-made solution for sampling from xml.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

N

Nadim Zakirov, 2020-10-30
@zkrvndm

Such a problem can be solved in any programming language, but you will not find ready-made solutions, you have to write it yourself. I myself would write such a parser in JavaScript and stupidly format it as a small local html file: open this file in the browser, pick up a folder from the disk in the input type = "file" field , and then read all the files from the selected folder with JavaScript and parse with new DOMParser() .
Why JavaScript and not php or python? Simply JavaScript is the most ideal language for parsing html. Out of the box, there is a rich set of tools for working with html code, no other language can work with html as well as JavaScript - after all, it is literally created for this.

S

Sergey Karbivnichy, 2020-10-30
@hottabxp

Maybe there is, but there is no "magic" button. You need to know a little about the structure of the html document. I do this in Python. Parsing in python can be learned in a week or two. But if you know other PLs, then it's faster. If you write such a parser yourself, then your parser will have unlimited possibilities.
Here is an example:

import requests
from bs4 import BeautifulSoup
from lxml import html
import os

def parsing(filename):
  with open(filename) as file:
    data = file.read()

  soup = BeautifulSoup(data,"html.parser")
  title = soup.find('h1',class_='question__title').text.strip()
  print(title)

os.chdir('html')
fileList = os.listdir('./')

for file in fileList:
  parsing(f)

We download several pages from this site to the html folder, run the script and it prints the question titles to the console. You can save headers (and other data) to a file or database.

D

Dimonchik, 2020-10-31
@dimonchik2013

mine in the content downloader was something similar, but simpler with scripts