Answer the question
In order to leave comments, you need to log in
What technologies to choose for Data Mining project?
Hello. I want to write a pet project using Data Mining technologies, but apparently I'm a little behind the times in this area. In general, I ask for advice on what technologies are best to use for this.
So, my task is:
1) There is a set of several million text files in Russian, English and Ukrainian, which contain a set of features in the form of a plain text description. I found a solution for English, but there is nothing for Russian and Ukrainian.
2) The original data will be stored on the server in the form of text files, the database will store already prepared data - id, feature set and a link to the original file.
3) The data will be processed by several Data Mining algorithms (building a decision tree (CART or C4.5), classification (kNN), clustering, etc.). The results will be passed to the Web UI or via the REST API to the end user.
There is a choice:
1) DBMS: I think to use PostgreSQL or MySQL. You can also try Mongo DB, but I have all the data structured, so I'm not sure if I need a NoSQL database.
2) Technology for searching for features in the text. I did not find anything suitable for Russian and Ukrainian, it seems that I will have to parse by keywords, and then check the quality manually.
3) The actual Data Mining solution. Found some libraries, for example:
github.com/haifengl/smile
github.com/apache/mahout
www.cs.waikato.ac.nz/ml/weka
orange.biolab.si
But on the Internet there is very little description of their capabilities to make a choice. In contrast, I'm thinking about using the www.h2o.ai service, but its excessive complexity confuses me.
In addition, I would like to use one language for the entire backend, and not one module in Java, another in Python, etc.
Answer the question
In order to leave comments, you need to log in
I didn't quite understand your question.
Do you need signs?
Take tf-idf.
Do you need stemming?
take snowball stemming
Do you need to bite stop words?
Download the list of stop words and bite.
------
The process of extracting features for text is quite simple:
Remove stop words -> POS tagging -> Stemming -> Scoring(tf-idf)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question