How to learn to write parsers?

B

beduin012019-08-15 10:17:01

Parsing

beduin01, 2019-08-15 10:17:01

There are very non-standard XML documents with an irregular structure.
There is a set of tags that I expect to find in them.
Target data can have arbitrary nesting and high variability in names. Some data may be nested.
It is impossible to determine the structure of all documents in advance.
The question is which approach should be used? I heard that an analogue of a state machine is needed here, but maybe there are still approaches? And how should everything be organized?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

T

tsarevfs, 2019-08-15
@beduin01

XML is parsed by any library. We get a parse tree.
https://pep8.ru/doc/dive-into-python-3/14.html
Then you start to go around the tree and check for each node (node) whether it suits you.
https://ru.wikipedia.org/wiki/%D0%9E%D0%B1%D1%85%D...
The check function should be all the magic. Perhaps you can come up with a heuristic rule from different parameters. For example:
*path from the root (root/part/segment/item)
*tag name
*tag parameter values
*child tag names
*...
If necessary, you can try to speed up the process if you know something about the data. So you can not go around the whole tree, but discard its parts if we understand that this is not what we need.
If there is a lot of data and the variability is very large (for example, we are looking for ads on web pages), you can do machine learning. This is a separate complex topic that goes beyond the scope of the question.

A

Antonio Solo, 2019-08-15
@solotony

beautiful soup