How to classify big data using Sklearn?

A

Astrohas2019-05-21 19:41:10

Python

Astrohas, 2019-05-21 19:41:10

Итак имеется относительно большая база данных размером в 50Гб, которая состоит из отрывков 486 000 диссертаций по 780 специальностям.
В научных целях нужно провести обучение модели на основе этих данных. Но увы ресурсы ограничены мобильным процессором, 16 Гб памяти (+ 16 SWAP) и ограниченным времени вселенной.
Был проведен анализ с использованием набора из 40 000 элементов (10% базы) (4,5 Гб) и классификатором SDGClassifier, и потребление памяти было в районе 16-17 гб.
Поэтому вот прошу помощи сообщества по этому поводу.
основной код логики таковой (наборы очищены от стоп слов и некоего мусора):

text_clf = Pipeline([
     ('count', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', SGDClassifier(n_jobs=8),)
 ],
 )
texts_train, texts_test, cat_train, cat_test = train_test_split(texts, categories_ids, test_size=0.2)
text_clf.fit(texts_train, cat_train)

PS: Увы использовать другие технологии возможности нету. Только scikit-learn

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Danil, 2019-05-21
@Astrohas

A couple of clarifying questions:
To get started, you can use the learning curve . It is possible that more data is not needed for the current model.

K

kova1ev, 2019-05-21
@ kova1ev

what is the question then? or train the model in iterations - trained on one piece of data, saved the model, took another piece of data. Or, as an option, train ten models on different data, and make a prediction based on the results of these models.