Answer the question
In order to leave comments, you need to log in
Optimize query with GROUP BY by row on large table?
MySQL. There is a table with news, a lot of entries - already about 70 thousand and will continue to grow.
The structure is like this:
CREATE TABLE IF NOT EXISTS `news` (<br/>
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,<br/>
`id_section` int(11) NOT NULL,<br/>
`title` varchar(250) NOT NULL,<br/>
`description` text,<br/>
`image` varchar(250) DEFAULT NULL,<br/>
`url` varchar(250) NOT NULL,<br/>
`timestamp` int(10) unsigned NOT NULL,<br/>
`active` tinyint(1) unsigned DEFAULT '1',<br/>
PRIMARY KEY (`id`),<br/>
KEY `id_section` (`id_section`),<br/>
KEY `timestamp` (`timestamp`),<br/>
KEY `title` (`title`),<br/>
KEY `active` (`active`),<br/>
KEY `url` (`url`)<br/>
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=69653 ;
SELECT `news`.* FROM `news` WHERE (active = 1) GROUP BY `url` ORDER BY `timestamp` desc LIMIT 10 OFFSET 20
Answer the question
In order to leave comments, you need to log in
can make a hash for each added news (from the url or the entire text or heading) and shove it into the field that is specified as unique - then the subd itself will filter out duplicates
why not make a many-to-many relationship between the news and sections tables?
then grouping will not be needed.
By the way, EXPLAIN, what does it produce on your request? Maybe tune the base, because 70,000 are trifles.
I thought about it. It turned out to be quite difficult.
The table contains a large amount of data. New entries are added every 20 minutes. It turns out that every 20 minutes we need to run through the entire data array for each new record and understand whether there was already such news or not. If there was, we take its ID and write down “there is another section for this news”.
Am I understanding the idea correctly? Store only unique news, and transfer duplication to an intermediate table. The problem is resource-intensive calculation of "duplicates". Comparison by url (variable length string).
without GROUP BY this query takes 0.7 seconds
try grouping like this: GROUP BY MD5(url)
and get rid of the limit, like this:
where id > 20 and id < 30
the example above is only suitable for solid id's (i.e. no breaks/spaces)
Query caching ( http://habrahabr.ru/blogs/mysql/108418/ ) can be a temporary solution.
And why is MyISAM, and not InnoDB selected, so as not to lock the table and it would be possible to partition the table, for example, by month-year block, the volume of current data would be reduced. Well, I would not use timestamp, but for example datetime.
Take out the last 20 entries, for example, to Redis (Memcache, to memory, to a file on disk), and read from there in the front, it could be a simple serialized array in memory, array_shift (array_push), etc. like a list.
And check duplicates immediately before adding, every 20 minutes it's not scary.
Where do duplicates come from? Automatic addition of news - do you mean grabbing from other sources, hence the duplicates?
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question