Algorithm to cluster documents from multiple parts?

J

Jaitl2016-09-05 10:36:12

Python

Jaitl, 2016-09-05 10:36:12

Hey!
I need to cluster documents that have multiple parts, content, title, cities, etc.
Sample document model: Doc(content: String, Title: String, geo: array[String], persons: array[String], ...)
Text fields will be represented as vectors.
It is desirable that each part could be indicated by the weight.
What clustering algorithm can be used? Are there implementations of similar algorithms in Python?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

R

Roman Mirilaczvili, 2016-09-08
@2ord

fastText

... Facebook announced the open source of the fastText library, which provides tools for text classification using machine learning methods

(Note: classification, not clustering)
Classification is a code of classes known in advance and all classified elements must be attributed to them.
Each document contains related data.
The concept of "vector" in machine learning, first of all, should be understood as a set of which features represent certain data.
First, you need to normalize and filter the data. And the text is raw data, unsuitable for machine learning, because the machine is not a person who understands the meanings of words (again, usually no more than 2 different languages).