Where and how can I assemble a corpus in different languages for a web page classifier?

T

Twindo2018-01-15 15:50:46

natural language processing

Twindo, 2018-01-15 15:50:46

The task is to make a site classifier similar to Similarweb ( list of categories )
How can I assemble a corpus to train such a classifier in different languages? What library approaches are better to use for this? Maybe someone has already made such classifiers for production, share your experience: architecture, algorithms, performance, technology stack, problems, pitfalls, etc.?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dimonchik, 2018-01-15
@dimonchik2013

Similar takes into account the behavior / interests of users, including, if not at all, in the first place (see how many Unknowns he has), accuracy also limps
in general for corpora, they didn’t come up with anything better than Wikipedia: the language is quite lively (not literary), far from academic
without markup, or trained / marked up, of course, will not learn much, therefore, as a rule, such tasks go through the English version
; however, the task is solved not only through the corpus, but also through Open Graph, page structure, etc.