Answer the question
In order to leave comments, you need to log in
Where and how can I assemble a corpus in different languages for a web page classifier?
The task is to make a site classifier similar to Similarweb ( list of categories )
How can I assemble a corpus to train such a classifier in different languages? What library approaches are better to use for this? Maybe someone has already made such classifiers for production, share your experience: architecture, algorithms, performance, technology stack, problems, pitfalls, etc.?
Answer the question
In order to leave comments, you need to log in
Similar takes into account the behavior / interests of users, including, if not at all, in the first place (see how many Unknowns he has), accuracy also limps
in general for corpora, they didn’t come up with anything better than Wikipedia: the language is quite lively (not literary), far from academic
without markup, or trained / marked up, of course, will not learn much, therefore, as a rule, such tasks go through the English version
; however, the task is solved not only through the corpus, but also through Open Graph, page structure, etc.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question