G
G
Gasoid2013-04-26 21:05:46
Algorithms
Gasoid, 2013-04-26 21:05:46

How to correctly display and identify similar news on the site?

The task for the main news on the site is to display a list of related news. How to identify and search for such news? By tags? Or to filter out the name?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
M
Maxim Dyachenko, 2013-04-28
@Gasoid

Tags are good. But not always applicable. Not always they are, and in sufficient quantities.
The name can be very different, although it is also important.
If the project is in php, then the library is rather better to take phpmorphy.
well, or some kind of stemmer, but then you won’t display words as tags to the user.
The first thing that comes to mind is something like this algorithm:
1 - we bring the texts to the basic form
2 - we discard all stop words in the dictionary. You can also select morphemes from phpmorphy, but this is IMHO redundant. It’s easier to tune the dictionary with pens for yourself.
3 - run through synonymization (optional, in fact, the quality of the dictionary depends on the subject, sometimes it’s better not to drive).
4 - we calculate the relevance of our words to the text. I would take the simplest nausea algorithm. I would only add weight to the words from the title of the article, and if there is a great desire, I would add weight, taking into account the morphemes issued by phpmorphy. (for example, adjectives would give less weight than nouns)
5 - we select the top N of our keywords, and tie them to each article. You will choose the number of keys based on your task, but from experience I think it will be between 5 and 10.
6 is now the most difficult. You need to make a request. Here you need to smoke and experiment. IMHO it is worth calculating a certain "rating" of proximity, here either the number of matching words from our top is suitable, or make this number weighted (depending on the position of the word in the list or on its weight in the text). Further, everything depends heavily on the implementation, on the ORM, etc.
PYSY: at the expense of morphological homonymy - personally, in such cases, I stupidly took the first option that came across. In total, this had very little effect on the result, and the resolution of morphological homonymy is still that task :)

P
pav, 2013-04-27
@pav

Implemented similar on solr used more like this request. Set the full text of the article as the search field

E
EminH, 2013-04-30
@EminH

If you have a classic php + MySQL bundle, then phpmorphy is the very thing.
You run all the text through phpmorphy, add it to a MySQL table field - after creating a FULLTEXT index for this field (type, EU-no, must be VARCHAR)
in the SQL query, use the MATCH construction (col1,col2,...) to determine the "similarity". )AGAINST(expr[search_modifier])
e.g.

SELECT *  FROM news WHERE MATCH (tags)  AGAINST ('слово другое третье' ) > 20 

here, tags is the same field, and 20 is the minimum relevance (this value depends on your content)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question