Optimize query with GROUP BY by row on large table?

UJey2010-12-22 02:50:30

MySQL

UJey, 2010-12-22 02:50:30

MySQL. There is a table with news, a lot of entries - already about 70 thousand and will continue to grow.
The structure is like this:

CREATE TABLE IF NOT EXISTS `news` (<br/>
 `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,<br/>
 `id_section` int(11) NOT NULL,<br/>
 `title` varchar(250) NOT NULL,<br/>
 `description` text,<br/>
 `image` varchar(250) DEFAULT NULL,<br/>
 `url` varchar(250) NOT NULL,<br/>
 `timestamp` int(10) unsigned NOT NULL,<br/>
 `active` tinyint(1) unsigned DEFAULT '1',<br/>
 PRIMARY KEY (`id`),<br/>
 KEY `id_section` (`id_section`),<br/>
 KEY `timestamp` (`timestamp`),<br/>
 KEY `title` (`title`),<br/>
 KEY `active` (`active`),<br/>
 KEY `url` (`url`)<br/>
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=69653 ;

There is a problem: entries are added automatically in such a way that the same entry can be added several times for different id_section .
Thus, when receiving data without specifying the id_section parameter (show news from all sections), duplicate entries appear. And, as a rule, they go in a row. This is bad.
On a small table, the solution was this query:

SELECT `news`.* FROM `news` WHERE (active = 1) GROUP BY `url` ORDER BY `timestamp` desc LIMIT 10 OFFSET 20

However, already now such a request is being executed for 4-5 seconds (!!!).
We need a solution that will allow reaching the indicator at least 0.5 seconds.
Note: without GROUP BY this query takes 0.7 seconds. While other queries from small tables take microseconds.
Any suggestions for optimization are accepted - not only this query. Perhaps there are some special techniques for solving such problems.

Answer the question

In order to leave comments, you need to log in

9 answer(s)

SovGVD, 2010-12-22
@UJey

can make a hash for each added news (from the url or the entire text or heading) and shove it into the field that is specified as unique - then the subd itself will filter out duplicates

MiniM, 2010-12-22
@MiniM

why not make a many-to-many relationship between the news and sections tables?
then grouping will not be needed.

Renat Ibragimov, 2010-12-22
@MpaK999

By the way, EXPLAIN, what does it produce on your request? Maybe tune the base, because 70,000 are trifles.

Iskander Giniyatullin, 2010-12-22
@rednaxi

I thought about it. It turned out to be quite difficult.
The table contains a large amount of data. New entries are added every 20 minutes. It turns out that every 20 minutes we need to run through the entire data array for each new record and understand whether there was already such news or not. If there was, we take its ID and write down “there is another section for this news”.
Am I understanding the idea correctly? Store only unique news, and transfer duplication to an intermediate table. The problem is resource-intensive calculation of "duplicates". Comparison by url (variable length string).

Here you are wrong on several points:
70,000 articles is not a large amount of data, it is a very small amount of data. Large is 4-5 orders of magnitude larger.
Every 20 minutes, do a SELECT `id` FROM `news` WHERE `url` = $url is a less expensive operation than doing GROUP BY `url` for each visitor
Variable length string - if the length is limited, you can make a varchar field and an index on it and that's it will work fine.

without GROUP BY this query takes 0.7 seconds

did you make an index on the active field?

fozzy, 2010-12-22
@fozzy

try grouping like this: GROUP BY MD5(url)
and get rid of the limit, like this:
where id > 20 and id < 30
the example above is only suitable for solid id's (i.e. no breaks/spaces)

peter23, 2010-12-22
@peter23

Query caching ( http://habrahabr.ru/blogs/mysql/108418/ ) can be a temporary solution.

Renat Ibragimov, 2010-12-22
@MpaK999

And why is MyISAM, and not InnoDB selected, so as not to lock the table and it would be possible to partition the table, for example, by month-year block, the volume of current data would be reduced. Well, I would not use timestamp, but for example datetime.
Take out the last 20 entries, for example, to Redis (Memcache, to memory, to a file on disk), and read from there in the front, it could be a simple serialized array in memory, array_shift (array_push), etc. like a list.
And check duplicates immediately before adding, every 20 minutes it's not scary.

Ilya Lesnykh, 2010-12-22
@Aliance

Where do duplicates come from? Automatic addition of news - do you mean grabbing from other sources, hence the duplicates?

Ogra, 2010-12-22
@Ogra

Maybe try DISTINCT?