I
I
IDDH2016-03-06 10:16:54
MySQL
IDDH, 2016-03-06 10:16:54

Join ban? Optimized selection from many-to-many relationship without join with parameters from linked tables?

At many conferences, programmers from large companies say that they have a ban on the use of join, here I understand that on big data this is a costly operation, but how, for example, to take a standard example, I have books and their authors with many connections to many. We will assume that the tables have "a large number of records". How to make an optimized selection if you need to use parameters from two tables in the condition (WHERE book.param = 1 AND author.param = 2). Yes, here, for example, the idea of ​​​​using IN (ids) comes up, but here again, in one of the tables there can be a large number of ids selected by parameter, and in my opinion this is not very good either (plus I think there is some kind of limit on the number ids)? How to do it more correctly?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
W
Walt Disney, 2016-03-06
@ruFelix

They say so because:
1) If you use JOIN everywhere, then on a large project, the cache built into the database stops working for you, because insert or update to at least one of the join member tables will reset the cache of the entire query, and on large projects, data changes are a constant stream.
2) JOIN makes hard connections at the data level, this buries the possibility of optimization at the level of the application architecture. When the tables are not connected by foreign keys and queries, then we can transfer any table to another database optimized for the required types of queries, write, for example, in C a separate service / daemon for this data. In this case, we will need to rewrite only one entity in the application. In the case of allowed JOINs, it may turn out that everything needs to be rewritten.
3) There is a popular approach to digesting large loads / data is sharding, i.e. spreading data ranges over different servers, this is when the first 10 million records lie on one server and the second on another, join cannot be done in this case.
4) Normalization, a fully normalized database is the slowest (since a bunch of JOINs, each of them is a multiplication of two matrices), but the most compact, completely non-normalized database is the fastest (since we take everything with one simple query), but very fat and unacceptably difficult to work with.
Your example is very abstract, if it is only clear about it that there is a lot of data and it is not known who will work with it and under what conditions, but the response speed is important, and the number of requests will be large. (For example, this is an API to which third-party people will write applications and potentially advertise these applications on TV)
For example, like this:
1) With a JOIN query, feed the data of these two tables into a search index on sphinxsearch
2) We make a request with parameters book.param = 1 AND author.param = 2 to the search index of the sphinx, it returns us the PK IDs of the required entities
3) Do SELECT * FROM t WHERE id in(1,2,3..N)
Thus, we get a complex and heavy background indexing, but very fast online requests that eat up crumbs of server resources. Of the minuses, we greatly complicate the architecture, and, accordingly, writing, debugging, and maintaining code becomes much longer, and the number of people who can do this is much smaller, which in turn is a serious problem, but on a different level.

D
doktr, 2016-03-06
@doktr

If the fields on which the JOIN is performed have indexes , then the query will go much faster than without them, so you need to look at it individually. If there are no indexes, then the execution plan will most likely be FULL SCAN and the total time will be proportional to the product of the number of rows in the two joined tables - O(M*N) .

O
OnYourLips, 2016-03-06
@OnYourLips

here I understand that on big data this is a costly operation
This is not the main reason. The data can be distributed.
How to do it more correctly?
To decide what is right and what is not, you need to have specific conditions. In your example, they are not.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question