C
C
cbv2016-03-30 17:38:19
PHP
cbv, 2016-03-30 17:38:19

Who can criticize/improve the price list recognition algorithm?

There are providers with different price list formats (which columns are available, which ones are not, column names). Moreover, price list formats for one supplier are constantly changing, so you need to write a function that will determine what is contained in a given price list column.
The following columns are possible: article, code, barcode, name, balance, guarantee, wholesale price, small wholesale price, retail price. The value in some cell may be missing.
I propose the following algorithm:
1. Determine the columns in which there is data, since there are empty columns.
2. Make a query to the database - search for values ​​from the first 50 price lines in each field. Where the result will be close to 100% - there is the right field.
3. The columns with prices and balance will remain, to determine what is in them, use the following filters to determine where the price is and where the balance is:
1) quantity is not a number, but an approximate designation <5, ** and t .d.
2) the price, as a rule, contains the currency
3) if we have, for example, 4 columns, then you can find the values ​​in which columns are connected and in which they are not
4) if there are 2 columns - the quantity is less.
How to improve it?

Answer the question

In order to leave comments, you need to log in

6 answer(s)
S
sim3x, 2016-03-30
@sim3x

Too many side effects will have to be processed. Automation will be
too expensive.
If you have a service for parsing price lists, then it's worth it
. It's easier for you to introduce the concept of preset, those linking the column number to the data type.
You show the user 20 random lines from the file
He selects a preset
Then you process
If you put everything in a database transaction with a rollback, then you won’t even spoil the data If
the volume of such work will increase, then you can start collecting statistics
bd
And here it will already be possible to think about fashionable technologies

T
ThunderCat, 2016-03-30
@ThunderCat

If it's Excel, it's easier to spend 3 minutes and rearrange the columns manually. Select the entire column and drag with the mouse - 5 sec. Yes, this is a monkey job, but it's easier than cleaning up the database from erroneous prsing later. On the edge, load columns in the admin panel (say, the first 10 values ​​from the price list) and arrange what is what with selects.
For an automaton (if you have already decided to write), it will be easier to use regular expressions than a comparison with the base. For example, the article is always a certain length? Or a barcode? Does the price have 2 decimal places? Does the title start with letters normally, and only contain letters, numbers, and whitespace characters? In short, determine the uniqueness of the field and use it in the regular expression.

P
Pavel O, 2016-03-30
@gans5131

What if you don’t get attached to the order of the columns, but display the price list as it is in Excel. Accordingly, we write to the database not each column / cell separately, but the entire table in the form of a serialized array.
I can even customize the plugin for WP

D
dmitriy, 2016-03-30
@dmitriylanets

I see the option of processing the price header and determining the location of the columns compared to the current markup configuration, that is, if the config indicates that the 2nd column is the article, then the new price from the supplier you check the 2nd column and specifically the header for the options article, code, article, code, etc. if it does not match, then display a message for manually reconfiguring the markup.
If there is no header, then only the analysis of the first 100-200 lines for the expected data using regular expressions, but here somehow everything is not stable, but I don’t see another option.

I
InExSu, 2020-06-27
@InExSu

Hey!
Already done

O
Oleh_Oleinikov, 2021-09-22
@Oleh_Oleinikov

If it's still relevant. I load it into pandas, then I pass it to the "column recognizer" (looks for headings in the first n lines (by the number of keyword matches) and renames them to the names accepted in the program (to be able to merge several documents from different suppliers), reports columns that are not in the dictionary (ConfigParser), if it is critical - I supplement the config with new signatures).
Depending on the set of columns, processing methods are applied to each (validation, conversion to a single format).
The result of processing, errors, warnings, logs are accumulated in a dataclass instance. For merging, analysis, sampling, plotting - I load everything into a singleton, and then - pythondocx, matplotlib, pyqt, export by filters to Excel ...
I wanted to try to use machine learning to recognize columns, but I didn’t find anything similar on the Internet and have no experience (((. I assumed that weights can be set by indirect signs (unique values ​​in a column, number of characters, types, dictionary match, relative position etc.) If anyone has any ideas or directions to google, I'd be grateful.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question