S
S
StanSemenoff2013-03-12 14:54:29
MySQL
StanSemenoff, 2013-03-12 14:54:29

How to properly organize data sampling without repetitions?

There is the first entity - articles . It doesn't matter how they are stored in the database.
There is a second entity - logs . There are also no restrictions on how to store them.
The same article can be simultaneously published in several journals (for example, in two).
There is a third entity - subscribers . One subscriber can read several magazines at once.

How for the subscriber to select all articles published in these journals (which the subscriber reads) ordered by date of publication and without repetition.

The easiest way, as I see it:

1. Make a table with articles:
posts
p_id, j1_id, j2_id, text, date

2. Make a table with subscriptions:
follows
f_id, u_id, j_id (u_id is the user id from some users table)

3. Make a selection:

select posts.* from posts inner join follows on (j_id = j1_id or j_id = j2_id) where u_id = 1 order by date desc

This query returns data with duplicates. Any DISTINCT or GROUP BY mechanisms can be used, but this creates an additional sort operation to remove duplicates.

You can do it with UNION, but it also uses the DISTINCT mechanism.

(select posts.* from posts inner join follows on j_id = j1_id where u_id = 1)
union
(select posts.* from posts inner join follows on j_id = j2_id where u_id = 1)
order by date desc

Maybe I didn't choose the right storage structure here.

Actually the question is, is it possible to somehow solve this problem in order to minimize the time of the required sampling on big data?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
A
Alexey, 2013-03-13
@jinxal

The schema is incorrect: in posts you are creating fields to link to journals. And if the article is immediately in 30 journals, will you create 30 fields?
In addition, the inclusion of selection conditions in "where" through the "or" operator can in some cases lead to a catastrophic drop in performance.
Correct:
posts: p_id, text, date
posts_rel_journals: p_id, j_id
journals: j_id
follows: u_id, j_id
Query:
select posts.* from posts where p_id in
(select p_id
from posts_rel_journals
join follows on follows.u_id = 1 and follows.j_id = posts_rel_journals .j_id)
On link tables, keys must include both fields:
posts_rel_journals: primary_key (p_id, j_id)
follows(u_id, j_id)
Don't be afraid of nested queries: they are terrible if used in the list of fields to answer, but in select conditions (i.e. after where) and independent of the outer query (no reference to outer query tables), they have little effect on performance (essentially, the optimizer will still bring it to the best form).
In this case, the search goes only on indexed fields, so the query will work out quickly.
If you fundamentally want to get rid of distinct (or in operations), then you must create duplicate articles for each journal, then there will be no problem with sampling (although the hard disk will be clogged. However, if there are few duplicates, then this is not fundamental)

R
rPman, 2013-03-12
@rPman

And what exactly slows down when unloading a list with duplicate posts? If you are not satisfied with the speed with which distinct is processed, do the deduplication yourself, and in order not to unload the articles themselves, first get a list of id and then, based on them, unload the necessary records from posts
. And you can do this directly on the server side, adding the id to a temporary table (in memory)
ps by the way, if the number of articles per query is relatively small - hundreds, you can do this with the query select * from posts where id in (....)

R
Ruslan_Y, 2013-03-12
@Ruslan_Y

A simple option through exists or in (select ...) instead of inner join will not work?
select posts.* from posts where id in (select j_id from follows where u_id = 1)
or
select posts.* from posts where exists (select j_id from follows where u_id = 1 and posts.id = j_id)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question