Is it difficult to sort messages/posts thematically in a thread?

T

Taras Labiak2017-03-31 21:35:38

data mining

Taras Labiak, 2017-03-31 21:35:38

Let's say there are a number of publics / groups / people (a number of walls) on VKontakte that are periodically updated. The task is to determine the subject of the message based on the flow of posts and some additional information about the wall (there is a fixed list of topics by which to classify) and determine whether the post is a job offer, an offer to complete a project or task, a purchase or a sale.
The question is whether it is difficult to carry out such a classification and what AI methods/algorithms can do this. It is necessary to show these messages to the end user, who, having chosen the topic of interest to him, sees only the posts corresponding to it. The probability of false classification by topic and usefulness (whether it is a suggestion of something) should be less than 50%.
How much more difficult would it be to make the error rate less than 10%? Is it difficult to make the algorithm learnable so that the probability of a false positive decreases over time?
Vkontakte is indicated as an example. Technical questions how to scan a large number and which publics not to consider

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

K

kotofey, 2017-03-31
@kotofey

What you write about is already quite implemented, for example, by these guys here - shikari.do
I won’t say how difficult it is, but if there is, then it’s quite possible and at first glance the error is much less than 10%.

A

Alexander Pozharsky, 2017-04-18
@alex4321

Probably irrelevant, but - the task is reduced to the classification (possibly - clustering) of texts?
If the former, it might be worth looking in the direction of abbyy smartclassifier (maybe new ready-made classifiers with support for the Russian language have already been added).
ps there were posts from https://habrahabr.ru/users/ServPonomarev/ . My backend implementation https://github.com/alex4321/w2v-cluster-distance-c... still worked on a small dataset (judging by his posts, the algorithm should work on large datasets). However, I certainly do not recommend using it :-)
z.s.2. as to "Is it hard to make the algorithm learnable so that the probability of a false positive decreases over time?" - in the case of similar to the above - it should not become a big complexity (of course - you will need to add an example to the dataset and retrain), in the case of ANN - you may need to change its configuration.