What is the best way to store data for further processing?

K

Kroid2014-12-21 17:37:52

data mining

Kroid, 2014-12-21 17:37:52

I started playing with the analysis of language data and such a question arose - how is the data usually stored before processing? There are, for example, a hundred gigs of text files, the data in which is separated by tabs. And I still don't know what exactly I need in them. You can parse them and upload them to postgres or monga, and then take huge samples from there (use a cursor?) and do something with them. Or leave it as it is, and as needed - parse with some kind of hadup or something like that.
In general - share who knows how the workflow occurs in this area. Maybe there are good articles on the topic?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dmitry, 2014-12-22
@Gabriel_vs

Storing data in BigData sometimes borders on art. In general, here, as elsewhere, everything depends on the task that is set. In any case, work/analysis with textual data will inevitably lead you to the Inverted Index (moreover, to several).
In short, you need to index the contents of "raw data" (files, web, database, etc). For now, index as is, without changing the data itself. If this is really BigData, then you need to think about a distributed index, understand whether (and if so, how) to replicate the index (but this is already a performance issue).
Also, for work and analysis, you will definitely need an index of the same structure, with the only difference being that the data that will be stored in it must be normalized. At least for tokens (words), apply the stemming algorithm (or lemmatization, if you want to get better quality).
Again, depending on the objectives (direction of analysis) you need to think about thesauri to solve the synonymy of terms in your index. But, I have already gone in the direction of a deeper analysis of the data. There's a lot you'll need.
If there was an example of a specific task, then I would write more specifically about tools, approaches, methods.
Silent Links:
1. For Information Retrieval/Data Analysis, read this:
Introduction to Information Retrieval, Manning
Processing of unstructured texts. Search, org...
2. Import / frameworks / indexing and search libraries:
Apach Solr
Apach Tika
3. Inverted index
PS: I would still like to know about a specific task, then it would be more specific.
UPD: in some cases in BigData it is necessary to manipulate the graph data structure. Accordingly, look towards the appropriate DBMS, such as neo4j. The main requirement for a DBMS in BigData is the minimalism of the functionality, otherwise everything will work extremely slowly on big data.