How to correctly set the task of developing an application or which ready-made application to use?

Y

YaTe2018-11-04 21:14:25

data mining

YaTe, 2018-11-04 21:14:25

The task is:
Parse documents from a relatively unstructured form and transform data from them into a structured form for loading into the database, while the list of fields in the database is finite and known in advance.
Nuances:
1) Documents can be in different formats, such as Excel, PDF, and sometimes it's just a web page on the site
2) As a result, data coming from different sources is presented in different ways. At the same time, if in excel the data is more structured (almost everything is contained in a table), then in PDF, part of the data can be represented by text and be common to all the elements presented
3) The task is repetitive, that is, in any format over time it will several documents
4) There is a possibility that in some documents there is not enough data to completely fill the database, then ideally it should be possible to add an additional document
To make it clearer, let's take an example: the specifications of some pieces of iron, for example, hard drives. We want to create a database with disk specs. There is a certain set of parameters that we want to add to our database.
Samsung has PDFs on their website with the specifications of their drives (as an example, the 983 series. Most of the necessary specifications are in the table. There are disks in two form factors at once, respectively, for example, dimensional characteristics will be common for two different families: 2.5 "and M.2 disks. And some information, for example, interface or MTBF, will be common for all disks In this case, the cells will be merged. And some of the information will be available in the text, and not in the table, and it must also be pulled out from there.
But Toshiba's specifications are presented directly on the site . But the information there is structured differently and its volume But the conditional manufacturer XYZ will have excels on the site and the information will be structured differently
.
What discipline allows you to solve the problem of filling the database with data? Studying the description of such disciplines (processes) as Data Mining, Data Wrangling and others does not help to understand in which direction to look. That is, on the one hand, there is no need to predict anything and look for insights, which Data Mining is supposed to talk about, on the other hand, for Data Wrangling, the information is too poorly structured. That is, most likely you need some kind of tool that exploits machine learning / neural networks (to improve the quality of extracting information, especially from non-tabular blocks), but it’s not clear how to ask Google about the right tool or how to set a task for developers. The ideal answer is what tool (if it exists) solves such a problem, and if not, what kind of developer profile to look for in order to develop an application,
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

sim3x, 2018-11-04
@sim3x

In the general case - this is solved by hiring staff who will write parsers for each manufacturer
Or by purchasing api or the entire array of information
Parsing pdf / excel - a weakly lifting task
Especially since everything is already on the web
https://www.samsung.com/ semiconductor/minisite/ssd...

D

Dimonchik, 2018-11-04
@dimonchik2013

tamita parser
but in general there is no simple solution, as sim3x said, a set of parsers is usually written

Y

YaTe, 2018-11-11
@YaTe

If I understand correctly, then parsers are a solution that was available in the "past life", that is, they could be written 10 years ago. The whole point is to get rid of writing rigidly algorithmic parsers due to new technologies (ML / neural networks / ...), especially considering that data can change from document to document (composition, format) even for one "vendor"
Any other ideas?