Extracting information from a large number of documents. How?

T

Timur Tuz2016-12-04 00:34:46

data mining

Timur Tuz, 2016-12-04 00:34:46

Firework! there is a task: there are several thousand text documents of the same type, in which there are common logical blocks (not to be confused with the document scheme). It is necessary to extract knowledge from these documents and bring them to figures. simple functions like regular expressions are not suitable. Something more advanced is needed. I have never come across these areas, I can’t understand what algorithms and tools can be used to solve such a problem. I realized that this is textmining and then it’s not clear where to look

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

A

al_gon, 2016-12-04
@al_gon

It is not entirely clear to which numbers you want / should convert the extracted information.
In general, the task is similar to the problems that NER solves https://en.wikipedia.org/wiki/Named-entity_recognition
Known tools:
https://en.wikipedia.org/wiki/OpenNLP
nlp.stanford.edu/software/CRF- NER.shtml
https://en.wikipedia.org/wiki/General_Architecture...
https://ru.wikipedia.org/wiki/UIMA
I can imagine that UIMA is more than enough for you.

D

Dimonchik, 2016-12-04
@dimonchik2013

NLTK

X

xmoonlight, 2016-12-04
@xmoonlight

https://nlpub.ru/Mystem
Text processing with mystem in php

A

Alf162, 2016-12-08
@Alf162

It is worth looking towards algorithms like word2vec (doc2vec, lda2vec, etc.). If you need something simpler, then something like tf-idf will do. All this is implemented in python, slkearn, gensim