Answer the question
In order to leave comments, you need to log in
open html parsers?
I use libxml2 to parse html. In general, satisfied, but I want something faster.
I watched some open source search engines (Xapian, Dataparksearch) - they have their own parsers. Dealing with their source codes and adapting to your needs is not yet mature, although it is already close to that.
Does anyone know other open source parsers that are lighter and faster than libxml2? Neither Google nor Yandex could help me. Maybe that's not what you asked.
Answer the question
In order to leave comments, you need to log in
Why not use regular expressions if you just need to pull out pieces of the page? Getting the title is /(\w+)<\/title>/gi, collecting links is something like /<a[^>]*href="([^>"]*)"[^>]*>( \w+)<\/a>/gi (however, this regular expression doesn't work if there are other tags in the link text.) Sit down and brainstorm over them... and it will probably work.
collection of all links from the page in the form
Faster than if you write a parser tailored for a specific purpose on your own is unlikely to come out.
Do you have some very specific and complex task that you are using libxml? Maybe, of course, it’s my hands that are crooked, but no matter how much they tried to parse complex XML, each time I realized that handles are both faster and more reliable :)
phpquery has a lot of functionality but not quite the same in terms of speed. It's better to cast HTML to XML and process it with XSLT. I think the speed of work is quite satisfactory.
Parsers are also interested. Looks like these might work - Grab, Scrapy or PHP HTML DOM parser
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question