Answer the question
In order to leave comments, you need to log in
How to parse very complex text?
How to parse very complex, illogical text? For example, on the website of the delivery service there are areas where they work:
Moscow city: st. Tallinn; Tvardovsky street, Turkmensky prospect (houses from 1 to 31). Serebryanobor forestry
Answer the question
In order to leave comments, you need to log in
If you want to parse something, it means that this "something" is very, very much. Especially if you want to use ML. In this case, you also need a training sample, where there are many, many examples of input data in a labeled form.
For example, on the website of the delivery service there are areas where they work.
It is easier and faster to hire 10 schoolchildren who will copy and divide the text into the necessary street-houses with their hands.
You can use services like dadata, out of the corner of my eye I saw in one project that they can parse such addresses into components, then you just pull out the fields you need from the object, you just need to study how financially it suits, mb they have a free request limit, if it is critical, well, or look for their analogues
it is called NLP or NLU
I am now solving a similar problem, I see the following approach (if there is a guru - criticize)
Moscow city: st. Tallinn; Tvardovsky street, Turkmensky prospect (houses from 1 to 31). Serebryanobor forestry
1) text pre-processing by filters
city [name] [.] street [name] [.] street [name] [.] [name] [.] avenue houses from 1 to 31 [name] [.] forestry
2) pulling values
city [name] street [name] [.] street [name] [.] [name] [.] house avenue [range] [.] [name] forestry
3) normalization
city [name] [.] street [ name] [.] street [name] [.] [name] avenue building [range] [.] [name] forestry
if your subject is fixed, then you can select some marker words, in your case these are types of geo-objects
[geoobject] [name] [.] [geoobject][name] [.] [geoobject] [name] [.] [
name
] [geoobject] [geoobject] [range] [.] [name
]
[
geoobject
] name] forestry The
last point can be solved both algorithmically and with the help of the National Assembly. it is already necessary to look at how regular the text itself is. since you have delimiters in the specific example everything is trivially solved
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question