open html parsers?

B

bit2010-10-22 16:43:48

HTML

bit, 2010-10-22 16:43:48

open html parsers?

I use libxml2 to parse html. In general, satisfied, but I want something faster.
I watched some open source search engines (Xapian, Dataparksearch) - they have their own parsers. Dealing with their source codes and adapting to your needs is not yet mature, although it is already close to that.
Does anyone know other open source parsers that are lighter and faster than libxml2? Neither Google nor Yandex could help me. Maybe that's not what you asked.

Reply

Answer the question

In order to leave comments, you need to log in

7 answer(s)

E

ertaquo, 2010-10-22
@ertaquo

Why not use regular expressions if you just need to pull out pieces of the page? Getting the title is /(\w+)<\/title>/gi, collecting links is something like /<a[^>]*href="([^>"]*)"[^>]*>( \w+)<\/a>/gi (however, this regular expression doesn't work if there are other tags in the link text.) Sit down and brainstorm over them... and it will probably work.

T

t0os, 2010-10-22
@t0os

collection of all links from the page in the form

Isn't this just done with a regular routine?

M

mihavxc, 2010-10-22
@mihavxc

Faster than if you write a parser tailored for a specific purpose on your own is unlikely to come out.
Do you have some very specific and complex task that you are using libxml? Maybe, of course, it’s my hands that are crooked, but no matter how much they tried to parse complex XML, each time I realized that handles are both faster and more reliable :)

S

Silbers, 2010-10-22
@silbers

you may be interested in simplehtmldom.sourceforge.net

T

Thomas, 2010-10-22
@Thomas

phpquery has a lot of functionality but not quite the same in terms of speed. It's better to cast HTML to XML and process it with XSLT. I think the speed of work is quite satisfactory.

E

Evgeny Sofonov, 2014-02-04
@sofcom

Parsers are also interested. Looks like these might work - Grab, Scrapy or PHP HTML DOM parser