S
S
sokolnikov2015-08-28 18:31:11
MySQL
sokolnikov, 2015-08-28 18:31:11

How can I rewrite the structure of tables or queries so that the index works for my selections?

Brief description of what we have.
We collect some statistics of visits to websites. The table of visits, in a simplified form, looks like this:

Table: links
id | url                                         | domain
1  | https://www.youtube.com/watch?v=6Nu3ZVA8Gic | com.youtube.www
2  | https://www.youtube.com/watch?v=5ww70Xb5pm8 | com.youtube.www
3  | http://www.bbc.com/ukrainian/politics       | com.bbc.www
4  | http://bbc.com/ukrainian/business           | com.bbc

Why do we write the domain in reverse order? Because we also have a table with information on large sites, for example:
Table: sites
id | name     | domain      | description
1  | YouTube  | com.youtube | ...
2  | VKontake | com.vk      | ... 
3  | BBC      | com.bbc     | ...

And it's easy for us to get statistics on visits to individual large sites so that MySQL indexes work. For example, we get links on the BBC website (including possible subdomains):
SELECT id, url FROM links 
WHERE domain = 'com.bbc' OR domain LIKE 'com.bbc.%'

Essence of the question.
Everything was fine until the tables grew to many millions of records (but in this case, the above example works quickly) and additional tasks for processing statistical data began to appear.
For example, we need to select a certain number of links, along with the corresponding information on the site. We do the following:
SELECT links.id, links.url, sites.id AS site_id, sites.description 
FROM links
LEFT JOIN sites ON links.domain = sites.domain 
             OR links.domain LIKE CONCAT(sites.domain, '.%')

And of course, because of the use of LIKE CONCAT in JOIN, the index for links.domain is no longer used.
For a while, when there were not too many entries in both tables, we slowly calculated the statistics with background tasks. But now even calculating in the background is not an option, it's too long, and too resource-intensive.
So I'm looking for advice, can I somehow rebuild the structure? Or do something with the queries to force the use of indexes (USE INDEX and FORCE INDEX don't want to work in my case).
And the advice is also important, which engine is better to use in my case MyISAM or InnoDB?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
O
Oleg, 2015-08-28
@sokolnikov

To begin with, I will explain why the indexes in your case do not work and cannot work.
> OR links.domain LIKE CONCAT(sites.domain, '.%')
CONCAT is a function and you work with the result of the function.
Those. it turns out that in your request you need to:
1. select all lines from links
2. add to each line by sites.domain or sites.domain the result of the function.
=> need to count each line every time.
This is a lot.
what i would do:
1. created a table of domains
in it:

id | main_id |domain
1 | 1 | com.youtube
2. | 1 |com.youtube.www
3. | 1 |com.youtube.subdomain

In all tables - would pass to this key.
2. then your selection is reduced to:
SELECT links.id, links.url, sites.id AS site_id, sites.description 
FROM links
LEFT JOIN domains ON links.domainId = domains.id
LEFT JOIN sites ON sites.id = domains.main_id

(I vaguely understand what you wanted in this request, so don't blame me :) )
I.e. main message:
switch to int-keys
PS: It's also called database normalization. Storing many identical strings is bad.

S
sim3x, 2015-08-28
@sim3x

3NF will look like this

domain_zone:
  parent = ForeignKey(domain_zone)
  name = Text

site_page
  domain_zone = ForeignKey(domain_zone)  
  url = URL

The bottom line is this: there is a root domain zone "." dot.
There is a com domain zone, it has a parent point
. For the bbc.com site, the parent will point to com.
In general, we do the same as DNS

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question