How to parse very complex text?

X

xebav104772020-03-10 11:24:46

Parsing

xebav10477, 2020-03-10 11:24:46

How to parse very complex, illogical text? For example, on the website of the delivery service there are areas where they work:

Moscow city: st. Tallinn; Tvardovsky street, Turkmensky prospect (houses from 1 to 31). Serebryanobor forestry

Here we see that the streets are divided at random, there is no definite sign. All sorts of notes like "from building 1 to 31" are also wedged in, maybe in brackets, or maybe after a comma, as if this is already a new street. Any noise can also be included: "on the banks of the Moscow River ...", etc.

The task is to copy these streets, along with the notes, ignoring all sorts of noise.

As I understand it, regular expressions are very difficult to do here? Then dig towards neural networks and ML? I tried NER (Named Entity Recognition) from Azure, but it seems you can’t train it there, it searches only as it was trained by the developers. Teach yourself then? Or are there any solutions?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

Sergey Pankov, 2020-03-10
@xebav10477

If you want to parse something, it means that this "something" is very, very much. Especially if you want to use ML. In this case, you also need a training sample, where there are many, many examples of input data in a labeled form.

For example, on the website of the delivery service there are areas where they work.

It's not clear from the question how much "dirty" data there is. Is there really SO many addresses in this form on one single site of the delivery service? Or there are a million such sites?
Paresers and ML only outperform human processing on a large, very large scale.
From your question it seems that you are asking how to make a parser that will parse ANY dirty data of any kind and type in any quantities.
Detail your question, show more examples.
Strong AI does not yet exist (and when it does, it will be no less lazy than those it is designed to help).
For you with your too general question, there are only general recommendations.
extract data in text form;

look through with your eyes: if writing a parser takes longer than shoveling it manually, and when replenishing a dataset, a completely new type of garbage may come, then the parser is not needed, process it with your hands;
set up dataset processing in a stepwise scheme so that at each step there are minimal non-crippling changes to as much data as possible, and the exhaust is transferred to the next step without losing the data of the previous step;
view the changes affected by each step, if you see data corruption, take additional intermediate steps;
make the text more structured by a series of steps from simple replacements by pattern and regexp: replace separators by patterns, escape or delete the contents of brackets, replace quotes with the same type, remove junk elements that are guaranteed not to give information about useful data; reveal abbreviations and abbreviations, give synonyms to one of the options ...
at some stage, your input dataset must turn from monolithic text into a CSV stream with one unit of data per line;
then do the same stepwise cleaning and deduplication of the stream, separate the records into separate fields, highlight new attributes.

Build the pipeline so that the changes made by each step can be analyzed as a diff. Make a separate analytical "view", where only the data affected by the step editing will be visible. You will immediately see jambs and crippling changes.
Count the edit statistics as each step, react to extreme cases.
It is important that at each stage the entropy of the dataset should decrease. Any loss of data is harmful, because to notice, say, that you have "r." denotes not only the city, but also, for example, "civilian" in a large number of toponyms at the end of processing - this is fatal. You will get confused. It is necessary to put the deleted data into separate fields and leave the opportunity to do analysis on them at later stages.
Keep the steps of writing the parser in the version control system, make more frequent commits with clear descriptions.
And may Knut and Stroustrup keep you.

A

alternativshik, 2020-03-10
@alternativshik

It is easier and faster to hire 10 schoolchildren who will copy and divide the text into the necessary street-houses with their hands.

P

Pavel Shvedov, 2020-03-10
@mmmaaak

You can use services like dadata, out of the corner of my eye I saw in one project that they can parse such addresses into components, then you just pull out the fields you need from the object, you just need to study how financially it suits, mb they have a free request limit, if it is critical, well, or look for their analogues

A

Antonio Solo, 2020-03-11
@solotony

it is called NLP or NLU
I am now solving a similar problem, I see the following approach (if there is a guru - criticize)
Moscow city: st. Tallinn; Tvardovsky street, Turkmensky prospect (houses from 1 to 31). Serebryanobor forestry
1) text pre-processing by filters
city [name] [.] street [name] [.] street [name] [.] [name] [.] avenue houses from 1 to 31 [name] [.] forestry
2) pulling values
city [name] street [name] [.] street [name] [.] [name] [.] house avenue [range] [.] [name] forestry
3) normalization
city [name] [.] street [ name] [.] street [name] [.] [name] avenue building [range] [.] [name] forestry
if your subject is fixed, then you can select some marker words, in your case these are types of geo-objects
[geoobject] [name] [.] [geoobject][name] [.] [geoobject] [name] [.] [ name ] [geoobject] [geoobject] [range] [.] [name ] [ geoobject
] name] forestry The last point can be solved both algorithmically and with the help of the National Assembly. it is already necessary to look at how regular the text itself is. since you have delimiters in the specific example everything is trivially solved