for7raid2014-05-31 10:50:51
Artificial intelligence
for7raid, 2014-05-31 10:50:51

AI algorithm for processing text and extracting columns of data

There is a structured text in the form of a table of the following form

Позиция           Код    С1   С2     С3
Кошки             1000   20   30     45
Собачки           2000   13   49    -40
Попугайчики       3000   45         -90
Свинки            4000	             10
Хомяки            5000	      67

You need to extract data from columns Code, C1, C2 and C3.
The problem is that in different tables, the distance between columns can be from 5 to 40 spaces, while, as you can see in the example, there may be no data in one of the columns at all. Text in columns can be centered on any edge or center.
Based on these conditions, the use of regular expressions does not always give the expected result, and data can be shifted from one column to another.
My idea is to teach some algorithm to parse the text into columns, drawing a conditional boundary between them, as a person would do, and thus get four arrays, from which you can already pull out the desired value by the row index.
I am not strong in AI and therefore I don’t even know which direction to dig, therefore, I ask you to suggest a direction for studying: which algorithm to choose, what is the best way to implement it, how to train it, etc.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
Sergey, 2014-05-31
Protko @Fesor

AI has nothing to do with it. Here you need to develop an algorithm, the usual stupid algorithm.
I would try to solve the problem like this:
- we always have the first column, then we have data.
- We select from the line the positions of all the data in the line, for example, for the line with parrots, we get that the value is far behind the previous one, which means one is missing before it. Well, and so by the distance between the values, you can make assumptions about which column it belongs to.

Andrew, 2014-05-31

AI has nothing to do with it.
If your downward moving columns cannot go to the side so much that the headers of the next column will go to the next place, then you just need to find familiarity spaces that are equal to spaces throughout the file from the very top to the very bottom ("whitespace columns"). Then merge adjacent whitespace columns, split each line by their positions and find inside the split either a number or a void (the number is missing). This algorithm is deterministic and has no parameters (there is nothing to configure in it).
If the previous one is not performed and the columns move out strongly, then you can run the same algorithm not globally on the entire file, but locally, for example, on nearby 3-4-5 lines - this will correspond to what a living person thinks that a column is 5 lines cannot go to the place of a neighbor. In the local version, you may already have to look for suitable parameters (number of monitored consecutive lines, maximum side shift, etc.)

Yuri Morozov, 2014-05-31

Use import.io and don't bother.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question