Answer the question
In order to leave comments, you need to log in
Is it necessary to fill in the matrix with examples with the entire dictionary of possible words?
I am solving a typical task from a textbook - classification of objects. In my case goods. I chose the simplest NaiveBayes algorithm.
php-ml.readthedocs.io
_ cars, services, clothes, etc. (without nested subcategories).
The algorithm of actions is as follows:
All words are converted into lemmas (basic forms of words), then into a lemma identifier (ID).
After that, as I understand it, you need to create a matrix where the row is the product (in my case), and the columns are the "weight" of the word in the product name (TF-IDF value). The question is about this part. From the textbook, as I understand it, it is necessary that the row of this matrix be the size of the entire dictionary, i.e. it will look something like this
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.4,0,0,0,0,0,0,0.7,0,0,0,1.2,0,0,0]
Answer the question
In order to leave comments, you need to log in
The size of the dictionary can be limited by taking the top, for example, 5 thousand words (previously sorted by "weight").
There are a number of heuristics that can also be applied:
I recommend looking at scikit -learn's TfidfVectorizer implementation.
There is such a thing as a sparse matrix. Only non-zero elements (with coordinates) are remembered. You can use a regular dictionary (association list).
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question