What's new about HTML parsing?

C

codecity2016-02-07 03:30:34

HTML

codecity, 2016-02-07 03:30:34

How are things going with HTML parsing now? I look to automate requests, Selenium WebDriver has gained popularity. Using the plugin for FF, a sequence of requests can be generated automatically.
But parsing is more difficult - xPath expressions must be written manually. For example, if you want to save a paginated table, it's not so easy anymore.
It seems that I met more convenient tools with a visual configuration. You select the desired element and the program generates xPath itself. But now I can't find anything.
Who can recommend?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexander Taratin, 2016-02-07
@Taraflex

xPath expressions must be written manually

Css selectors are easier
For php https://github.com/olamedia/nokogiri
For other languages, google analogues is also not a problem.

V

VZVZ, 2016-02-07
@VZVZ

> I look to automate queries gained popularity Selenium WebDriver.
I like PhantomJS more.
And another option with bare HTTP requests, when they take a sniffer like Fiddler, track requests and then imitate them in PL. Under Windows, they usually take C # for this.
This requires a lot of effort in terms of resistance to bot recognition by a server, and may not be so stable to changes on a server, but speed + no need to put anything left on the computer + requests can be made from any familiar PL without crutches.
Don't try to oversimplify everything.
In many industries, there are 2 possibilities. You can be a specialist, know everything from the inside, use tools of different levels. And you can only be able to poke buttons, record macros, etc., i.e. only to be able to use ready-made ultra-high-level tools.
So, with the second approach you will not go far.
This in no way means that the specialist always does everything in hardcore, writes in asma, etc. Real specialists also love convenience, quality, abstraction, etc. And after all, if the necessary simple super-high-level tool simply does not exist for his tasks, then just a specialist can write it himself (as they say, "give a person what he needs - he will want comfort"), and then what will an amateur do, who has picked up a few inches and knows how to only use ready-made ? Nothing.
> And here with parsing it is more difficult - xPath expressions need to be written manually
What are you talking about?
Where do you want to parse HTML from?
I don’t know about Selenium, but all the features of the JavaScript DOM API are available from the PhantomJS bot.
And for C # there is a good AngleSharp library for parsing HTML and CSS, where there is not only GetElementById, but also by class, by tag, by CSS selectors, and in general everything seems to be the same as in the DOM standard. True, it works slower than the usual HtmlAgilityPack (in which everything except GetElementById is done by XPaths)