Answer the question
In order to leave comments, you need to log in
What can be used to extract the contents of certain tags from hundreds of html documents and place them in one text document?
Good, Habrazhitel.
Ask for advice from knowledgeable people.
I have the following problem.
There are hundreds of similar html documents.
I need to extract all the contents of
the XPAth html/body/text()
tag from them
and put it into one text file.
Then, in this file, make a dozen autocorrect to
bring the formatting to the desired form.
for example s/<вr>/<вr>\n/
They advise different things. Learn Perl or PHP.
Learn shell.
Please advise what is the best way to do this.
I just don't want to hammer nails with a microscope.
Answer the question
In order to leave comments, you need to log in
Depends on the complexity of the pages - maybe you can get by with a simple grep.
Regexps, XML... don't be brainwashed. BeautifulSoup solves this problem with a bang. Parses any, even the most beaten HTML.
With the help of HTML parsers in java.
For example, here are a couple of them:
HTMLParser
Jericho HTML Parser
If without parsers and pages are not very complex, then it can be implemented in some language that supports regular expressions. These are Perl, Python, PHP and others ... I personally love Perl ... True, I use PHP more)))
It seems to me that the DOM tree is just right. The easiest way to implement it in Java.
hmm… lxml.de/lxmlhtml.html
from lxml import html
import os
with open("../results.txt", "w") as f:
for fname in os.listdir('./'):
tree=html.parse(fname)
body_content=tree.xpath("//body")[0]
all_body_text=body_content.text_content() #только текст из всех descendant-or-self
body_content_with_markup=body_content.tostring(body_content) #текст и разметка descendant-or-self
result=some_processing("какой-либо из предыдущих результатов") # какие-то доп. замены и манипуляции
f.write(result) # запихиваем все в один файл
xmllint --html --xpath //body
Learn sed or perl. At least you will not acquire useless knowledge that will not be needed after solving this particular problem.
What for for such task something to learn. Order on freelance, there you will get a program for 10 bucks.
It seems to me that the basic knowledge for working with arrays of text is regular expressions (see J. Friedl's books). And then there are tools that allow you to more or less conveniently work with regexps. Under Win, my hero is the PowerGrep mega-combine!
And if you don’t want to learn anything, you can combine all the files into one “copy *.html alltext.txt” and then torture it in text editors and sort it in Excel.
python s
scrapy.org
There's even a way to see how the daemon works through the web snout :)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question