What can be used to extract the contents of certain tags from hundreds of html documents and place them in one text document?

darzet2011-07-22 20:08:47

Perl

darzet, 2011-07-22 20:08:47

Good, Habrazhitel.
Ask for advice from knowledgeable people.
I have the following problem.
There are hundreds of similar html documents.
I need to extract all the contents of
the XPAth html/body/text()
tag from them and put it into one text file.
Then, in this file, make a dozen autocorrect to
bring the formatting to the desired form.
for example s/<вr>/<вr>\n/

They advise different things. Learn Perl or PHP.
Learn shell.
Please advise what is the best way to do this.
I just don't want to hammer nails with a microscope.

Answer the question

In order to leave comments, you need to log in

13 answer(s)

m08pvv, 2011-07-22
@m08pvv

Depends on the complexity of the pages - maybe you can get by with a simple grep.

ComodoHacker, 2011-07-22
@ComodoHacker

An example would be.
I use awk or sed in such cases.

sgzmd, 2011-07-23
@sgzmd

Regexps, XML... don't be brainwashed. BeautifulSoup solves this problem with a bang. Parses any, even the most beaten HTML.

Rafael Osipov, 2011-07-22
@Rafael

With the help of HTML parsers in java.
For example, here are a couple of them:
HTMLParser
Jericho HTML Parser

Velitsky, 2011-07-22
@Velitsky

If without parsers and pages are not very complex, then it can be implemented in some language that supports regular expressions. These are Perl, Python, PHP and others ... I personally love Perl ... True, I use PHP more)))

ComputerPers, 2011-07-22
@ComputerPers

It seems to me that the DOM tree is just right. The easiest way to implement it in Java.

Sergey, 2011-07-23
@seriyPS

hmm… lxml.de/lxmlhtml.html

from lxml import html
import os

with open("../results.txt", "w") as f:
    for fname in os.listdir('./'):
        tree=html.parse(fname)
        body_content=tree.xpath("//body")[0] 
        all_body_text=body_content.text_content() #только текст из всех descendant-or-self
        body_content_with_markup=body_content.tostring(body_content) #текст и разметка descendant-or-self
        result=some_processing("какой-либо из предыдущих результатов") # какие-то доп. замены и манипуляции
        f.write(result) # запихиваем все в один файл

But if you say more specifically what needs to be extracted and what transformations need to be done, I will write in more detail. Maybe it will be easier on XSLT if autocorrect is mainly about tags.
If on the bash - that is, xmllint, which allows including xpath queries to execute
xmllint --html --xpath //body
regulars, in no case do not use it because stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags /1732454#1732454

Vlad Zhivotnev, 2011-07-23
@inkvizitor68sl

Learn sed or perl. At least you will not acquire useless knowledge that will not be needed after solving this particular problem.

Setti, 2011-07-23
@Setti

querypath

Dmitry, 2011-07-23
@Neir0

What for for such task something to learn. Order on freelance, there you will get a program for 10 bucks.

Crush, 2011-07-23
@Crush

It seems to me that the basic knowledge for working with arrays of text is regular expressions (see J. Friedl's books). And then there are tools that allow you to more or less conveniently work with regexps. Under Win, my hero is the PowerGrep mega-combine!
And if you don’t want to learn anything, you can combine all the files into one “copy *.html alltext.txt” and then torture it in text editors and sort it in Excel.

Bardus, 2011-07-24
@Bardus

python s
scrapy.org
There's even a way to see how the daemon works through the web snout :)

vimvim, 2011-07-24
@vimvim

Your task is called web harvesting.
There is a special, very good tool for this: web-harvest.sourceforge.net/
The site has examples for solving typical (similar to yours) tasks.