Is it necessary to fill in the matrix with examples with the entire dictionary of possible words?

Z

zaartix2019-07-25 10:03:12

Machine learning

zaartix, 2019-07-25 10:03:12

I am solving a typical task from a textbook - classification of objects. In my case goods. I chose the simplest NaiveBayes algorithm.
php-ml.readthedocs.io
_ cars, services, clothes, etc. (without nested subcategories).
The algorithm of actions is as follows:
All words are converted into lemmas (basic forms of words), then into a lemma identifier (ID).
After that, as I understand it, you need to create a matrix where the row is the product (in my case), and the columns are the "weight" of the word in the product name (TF-IDF value). The question is about this part. From the textbook, as I understand it, it is necessary that the row of this matrix be the size of the entire dictionary, i.e. it will look something like this

[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.4,0,0,0,0,0,0,0.7,0,0,0,1.2,0,0,0]

Those. the dimension of the matrix will be equal to: "the entire dictionary of possible words" by "the number of goods", i.e. on a test sample of 10 thousand products, I got a dictionary of unique words with a size of 15 thousand. Thus, the matrix turns out to be 15k by 10k.
Is that how it should be? If so, then I don’t understand how you can replenish the training base, because if the dimension of the dictionary changes, you will have to compose the entire matrix again, because. its dimension in width will have to change.
With a dataset dimension of 15k by 10k, is it normal that 20GB of memory is consumed for training?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Danil, 2019-07-25
@DanilBaibak

The size of the dictionary can be limited by taking the top, for example, 5 thousand words (previously sorted by "weight").
There are a number of heuristics that can also be applied:
I recommend looking at scikit -learn's TfidfVectorizer implementation.

A

Alexander Skusnov, 2019-07-25
@AlexSku

There is such a thing as a sparse matrix. Only non-zero elements (with coordinates) are remembered. You can use a regular dictionary (association list).