Automation of a large amount of information?

A

alekseyizmaylov2018-06-25 09:43:09

Automation

alekseyizmaylov, 2018-06-25 09:43:09

Good day to all,
dear specialists, I need your advice:
there is a LARGE amount of information - regulatory documents in the field of construction,
how can you programmatically, using any programming language, automate this bunch of information so that you don’t re-read these tons of norms every time, but quickly and easily find the information you need?
I want to complete this task myself, I have little experience in programming, but I'm ready to gnaw at the granite of science - you need the right direction of gnawing))
How and in what language? Simplicity and efficiency are welcome)

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

DDDsa, 2018-06-25
@alekseyizmaylov

1. We determine the types and structure of documents. Parsing will require either keywords (for example, the first number after the phrase "Height:" or the entire line after the phrase "Task:") or the location of paragraphs, characters (For example, the last paragraph is always a description or a list starting with - or * is a list materials, etc.). If the structure is arbitrary, then there are two ways out:
- either we save the entire text (but a complex search cannot be implemented here);
- either we process each document manually (time-consuming if there are a lot of documents).
2. Choose the language that you like best; we are looking for libraries for this language to work with the necessary formats (doc, PDF) or ways to get out, for example, convert to another format that is more convenient to work with, etc.
3. We choose a database and a library for working with our language. We create a scheme (tables) that corresponds to the task.
4. According to the structure defined in paragraph 1, we begin to parse documents. First, we take one document and write a parser for it, then we try to try this parser for another document - we understand what needs to be changed, set conditions, etc. As a result, we should get a data set, like an array of objects, where each object is a parsed document.
5. We save the resulting array of objects in the database, we edit the schema along the way, because we probably did not take into account everything in paragraph 3.
6. In fact, this can be finished, then the search can be carried out by queries, database tools. But if it is interesting, or if other people will use the data, you can write an interface. This is the next big task: choosing an interface and implementing it.

S

Stalker_RED, 2018-06-25
@Stalker_RED

If you want to make a system from scratch for self-development, and the process itself and the exp are important, do it.
If you need the result, and not the process, then you should google ready-made systems like the same archivist , for example.

N

nrgian, 2019-05-11
@nrgian

Any general-purpose programming language will do.
From the specific sub-documents - perhaps a DBMS of the FTS type (for example, SphinxSearch)