Answer the question
In order to leave comments, you need to log in
How to parse a large XML file (>500Mb) containing errors in Python?
Conditions
1) There is a large XML file. Accordingly, loading it all into memory is not suitable.
2) The file contains errors, for example, inside the tags there may be unescaped html tags that are not closed.
lxml and sax can parse by gradually reading the file, but fall on unclosed tags inside the tags we need
BeautifulSoup does not fall on unclosed tags, but it looks like it loads the entire file into memory at once.
Are there ready-made solutions suitable for these conditions?
Answer the question
In order to leave comments, you need to log in
lxml in the option "parse at any cost" also does not work?
parser = etree.XMLParser(recover=True, huge_tree=True)
you can first imagine that xml is a plain text file and put the tag structure in order with string manipulation functions, and only then open it as xml
, for example, you can replace all html tags with "non-tags"
sed -e 's/<p/\& lt;p/g' -e 's/<\/p/\& lt;\/p/g' file.xml > new.xml
grep -o '<[^>]*>' file.xml | cut -f 1 -d ' ' | sort | uniq -c
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question