Answer the question
In order to leave comments, you need to log in
Determining the similarity of the subject matter of texts by tags
A little input. When adding an article, punctuation and words less than 2 characters long are removed from the text. Further, the words are reduced to a normal form (singular number, im.case) and everything is deleted except for nouns and Latin words, the occurrences of the remaining words in the text are counted, it turns out some kind of automatic tags + the number of their repetitions in the text. Next, the tags are inserted into two tables:
CREATE TABLE IF NOT EXISTS `tags` (
`tag_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`content_type` enum('news','article') NOT NULL,
`tag_name` varchar(120) NOT NULL,
`tag_counter` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Number of tag in all texts',
PRIMARY KEY (`tag_id`),
UNIQUE KEY `content_type` (`content_type`,`tag_name`),
KEY `tag_counter` (`tag_counter`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;
CREATE TABLE IF NOT EXISTS `tagstat` (
`tag_id` int(10) unsigned NOT NULL,
`content_type` enum('news','article') NOT NULL,
`content_id` int(10) unsigned NOT NULL,
`tag_counter` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Number of tag in certain text',
KEY `content_type` (`content_type`,`content_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Answer the question
In order to leave comments, you need to log in
A strange approach to determine the similarity of texts. By writing only the frequency of "tags", you violate the sequence of words in sentences, and the result will reflect the similarity of the frequencies of words in articles rather than their actual similarity in the text. Two articles with different content, but with the same frequency of tag words in them (for example, on the same topic) will be “similar” according to your algorithm.
For such a task (determining the similarity of texts), the shingles algorithm is more often used and more effective: www.codeisart.ru/python-shingles-algorithm/
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question