How to parse a large XML file (>500Mb) containing errors in Python?

U

un1t2012-11-16 10:39:53

Python

un1t, 2012-11-16 10:39:53

Conditions
1) There is a large XML file. Accordingly, loading it all into memory is not suitable.
2) The file contains errors, for example, inside the tags there may be unescaped html tags that are not closed.
lxml and sax can parse by gradually reading the file, but fall on unclosed tags inside the tags we need
BeautifulSoup does not fall on unclosed tags, but it looks like it loads the entire file into memory at once.
Are there ready-made solutions suitable for these conditions?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

E

egorinsk, 2012-11-16
@egorinsk

If the XML file contains errors, then it is no longer an XML file.

D

DmZ, 2012-11-16
@DmZ

lxml in the option "parse at any cost" also does not work?

parser = etree.XMLParser(recover=True, huge_tree=True)

In this mode, it will try to work around unclosed tags and invalid XML for as long as it can.

S

skomoroh, 2012-11-17
@skomoroh

you can first imagine that xml is a plain text file and put the tag structure in order with string manipulation functions, and only then open it as xml
, for example, you can replace all html tags with "non-tags"

sed -e 's/<p/\& lt;p/g' -e 's/<\/p/\& lt;\/p/g' file.xml > new.xml

a list of all html tags is googled
, you can pre-compare the number of opening and closing tags

grep -o '<[^>]*>' file.xml |  cut -f 1 -d ' ' | sort | uniq -c