Determining the similarity of the subject matter of texts by tags

D

Dmitry Sergeev2013-09-05 12:11:48

Algorithms

Dmitry Sergeev, 2013-09-05 12:11:48

A little input. When adding an article, punctuation and words less than 2 characters long are removed from the text. Further, the words are reduced to a normal form (singular number, im.case) and everything is deleted except for nouns and Latin words, the occurrences of the remaining words in the text are counted, it turns out some kind of automatic tags + the number of their repetitions in the text. Next, the tags are inserted into two tables:

CREATE TABLE IF NOT EXISTS `tags` (
  `tag_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `content_type` enum('news','article') NOT NULL,
  `tag_name` varchar(120) NOT NULL,
  `tag_counter` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Number of tag in all texts',
  PRIMARY KEY (`tag_id`),
  UNIQUE KEY `content_type` (`content_type`,`tag_name`),
  KEY `tag_counter` (`tag_counter`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;

and

CREATE TABLE IF NOT EXISTS `tagstat` (
  `tag_id` int(10) unsigned NOT NULL,
  `content_type` enum('news','article') NOT NULL,
  `content_id` int(10) unsigned NOT NULL,
  `tag_counter` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Number of tag in certain text',
  KEY `content_type` (`content_type`,`content_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

In the end, we have the tags of the article, the number of occurrences of the tag in the article, the total number of occurrences of the tag across all texts. How would you get similar articles now?
I tried the Jaccard similarity / index method, it turns out to be a very “long” request

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

M

moonsly, 2013-09-05
@moonsly

A strange approach to determine the similarity of texts. By writing only the frequency of "tags", you violate the sequence of words in sentences, and the result will reflect the similarity of the frequencies of words in articles rather than their actual similarity in the text. Two articles with different content, but with the same frequency of tag words in them (for example, on the same topic) will be “similar” according to your algorithm.
For such a task (determining the similarity of texts), the shingles algorithm is more often used and more effective: www.codeisart.ru/python-shingles-algorithm/