Answer the question
In order to leave comments, you need to log in
Beautiful Soup, html5lib or lxml?
It is supposed to be used for parsing user-generated content. Accordingly, the main requirement is the correct processing of broken HTML. Speed is not critical. The lxml
documentation has this:
Support for minidom, ElementTree (including cElementTree and lxml.etree ), BeautifulSoup (deprecated) and custom simpletree output formats
Answer the question
In order to leave comments, you need to log in
html5lib has the most correct and reliable parser (according to the spec), but it is slow. lxml is the fastest and parses quite well. You can use iterparse instead of SAX, it's often more convenient, and often faster.
in fact, beautifulsoup was developed for broken html, I don’t know why they dissuade it.
@ur001, I looked at the lxml code again, there are a lot of regexps scattered across modules ( github.com/lxml/lxml/blob/master/src/lxml/html/clean.py#L62 ), but I was completely wrong : html parsing in lxml is based mostly on xml parsing.
Those. lxml thinks html is just invalid xml that can be fixed. From a theoretical point of view, the assumption is incorrect, for parsing html5 and xml, completely different parsers are needed, but in practice it often works.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question