How to make a smart text filter by meaning?

Maxim2017-03-09 02:11:14

PHP

Maxim, 2017-03-09 02:11:14

Task:
- There is a parser (in PHP) for VK posts on the topic of renting apartments. It is necessary to determine whether the text was written by the Agent or the Owner.
How the problem is solved at the moment (clumsily):
- With the help of masks from stop words. There is a stop word in the text (note: agency) - it means agent.
What is the problem:
- Naturally, this approach is very clumsy and sometimes filters posts incorrectly.
Question:
- Where can I find information with examples on the implementation of such scripts? I just need to understand at least the concept of how to do it in the most practical way.
For now, I'm thinking of writing down this topic: the script will search not only for stop words, but also for words that define the Owner. I also want to assign points to all these masks. Those. we are looking for all the stop words (they minus the overall score of the text), we are looking for all the words that define the owner (they plus the overall score of the text), and by the scores we determine how much the post relates to the owner of the home.
I couldn't find anything on the internet myself. Thanks in advance for any information on this topic!

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Vitaly, 2017-03-09
@vshvydky

The word analysis idea is good, but add page analysis to it. The agent will definitely have more than one sale announcement on the wall of his page, which means he is definitely not your target, then minus it.

entermix, 2017-03-09
@entermix

Perhaps the Levenshtein distance will help ?

Nwton, 2017-03-09
@Nwton

1) View hundreds of ads
2) Analyze each one and find out what exactly helped you determine the ad type and quality
3) Write a script that analyzes the right details
4) Test the script on a couple of hundred different ads. See exactly how the bot analyzes the data, follow each step and refine the algorithm
. This is about a general approach to creating such things.

latteo, 2017-03-09
@latteo

You can’t always tell him by phone if the agent mows down under the owner.
For manual filtering I use:
- phone analysis, as a rule, agencies have a lot of apartments. And the owner of 90% has one announcement, 10% may have announcements of 2-3 apartments + may come across announcements for the sale of small things.
- image analysis: agencies are too lazy to bother and pictures can go to several ads, but alas, parsing Google is a long time.
- general text background + intuition. Here the machine is still powerless, although you can try to analyze several thousand texts to identify patterns through services like https://habrahabr.ru/post/243705/
PS: in the modern world, agencies are an evil in the form of an extra intermediary with huge requests and a minimal function to bring 2 people together without the slightest check who these people are.

Arseny Kravchenko, 2017-03-09
@Arseny_Info

1) Text tokenization and lemmatization
2) Bag of words or TF/IDF vectorization
3) Additional features about the uniqueness of the phone / address.
4) A simple linear model on top of these vectors.
Anyone who has ever worked with such tasks, cases for 1-2 evenings.

Renat Abyasov, 2017-03-16
@Abyasov

Sounds like you need to look into machine learning. In a nutshell, you can’t describe here, I recommend reading the materials on this topic. But I would go this way:
1) Create a feature set for each ad. You can start with what you already have (judging by the description and responses to comments). On the "bag of words", probably, you should not go straight away. There will be a lot of noise. Reducing the dimension of the feature space will be required. I suspect that at the start you can limit yourself to an indicator of the occurrence of marker words in the text. A list of such words can be made with pens. And further. I think the signs associated with the entry of contacts into several ads will be very important.
2) To train the model, you will need a sample of ads in which exactlyknow who posted it. This is the most important place. And the most difficult. I suspect that the sample will be small, in which case it will impose restrictions on the choice of model (you will work with the same trees and boosting over them).
3) The training sample will need to be pre-processed: categorical features (all sorts of identifiers, categories, topics) should be replaced with binary sets (one hot encoding), numerical ones should be normalized, the sample should be balanced.
4) Choose a metric by which you will evaluate the quality of the algorithm (accuracy, recall, F-measure or ROC-AUC or something else). This is a separate big topic and the choice will depend on your business model.
5) Collect several models, choose the most promising ones. And then you select hyper-parameters in these models. You may want to combine multiple models, but here you need to be careful with performance and overfitting.
6) Attach the resulting model to your service.
7) Find new features, training data, ideas and improve the model. There is no limit to perfection here))
And what you have now, most likely, without loss, can be described by an ordinary decision tree. So it shouldn't get any worse.