Beautiful Soup, html5lib or lxml?

D

Denis Egorov2012-05-11 18:37:29

Python

Denis Egorov, 2012-05-11 18:37:29

It is supposed to be used for parsing user-generated content. Accordingly, the main requirement is the correct processing of broken HTML. Speed is not critical. The lxml

documentation has this:

BeautifulSoup Parser
html5lib Parser

Those. it can parse with these libraries and return an lxml tree . The html5lib docs say:

Support for minidom, ElementTree (including cElementTree and lxml.etree ), BeautifulSoup (deprecated) and custom simpletree output formats

I will most likely need to go over the entire DOM, I think SAX will be convenient. Even so: run through SAX and build a new tree using certain filter-transforming rules and text typography.

Here in thought. Tell me what to choose?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

K

kmike, 2012-05-11
@ur001

html5lib has the most correct and reliable parser (according to the spec), but it is slow. lxml is the fastest and parses quite well. You can use iterparse instead of SAX, it's often more convenient, and often faster.

A

alternativshik, 2012-05-11
@alternativshik

lxml for sure.

P

pawnhearts, 2012-05-12
@pawnhearts

in fact, beautifulsoup was developed for broken html, I don’t know why they dissuade it.

K

kmike, 2012-05-21
@kmike

@ur001, I looked at the lxml code again, there are a lot of regexps scattered across modules ( github.com/lxml/lxml/blob/master/src/lxml/html/clean.py#L62 ), but I was completely wrong : html parsing in lxml is based mostly on xml parsing.
Those. lxml thinks html is just invalid xml that can be fixed. From a theoretical point of view, the assumption is incorrect, for parsing html5 and xml, completely different parsers are needed, but in practice it often works.