How would you implement computing a list of the user's "favorite emoji"?

E

EchoStan2020-09-13 14:30:01

PostgreSQL

EchoStan, 2020-09-13 14:30:01

Hello, friends.
In a chat input, it's useful to have a line with the user's favorite emoji. It looks like this:

How it is implemented now, I have
0. To store the user's favorite emoji, we use a separate small table:

CREATE TABLE top_used_emojis (
    user_id BIGINT PRIMARY KEY,
    emojis TEXT[] NOT NULL
);

1. The user sends a new message to the chat.

2. The message is saved to the message table. Let it be chat_messages.

CREATE TABLE chat_messages (
    user_id BIGINT NOT NULL,
    text TEXT,
    ...
)

A simple index has been added to the table; on paper, it promised to facilitate text searches.

CREATE INDEX chat_messages_text_index ON chat_messages (text);

3. In the application code, a simple search for matches is performed in order to determine if there are any emoji in the new message.

4. If there is no emoji, then do not touch anything.

5. If emoji are found, the application calls the pgplsql function that takes as input user_id, calculates the top 10 most common emoji in the user's message texts, and stores the result in a emojistable field top_used_emojisby the key user_id.

In its work, the function uses an auxiliary table where the emoji known to us are stored

CREATE TABLE emojis (
    emoji TEXT PRIMARY KEY /* Здесь лежит сама эмодзи */
);

And here is the body of the function

CREATE OR REPLACE FUNCTION updateTopUsedEmojis (BIGINT) RETURNS TEXT[] AS '
    DECLARE
        _user_id ALIAS FOR $1;
        query_result TEXT[];

    BEGIN

            WITH last_top_used AS (SELECT emoji, count(*)::INT AS count
                                   FROM chat_messages cm
                                            JOIN emojis e
                                                ON (cm.text LIKE ''%'' || e.emoji || ''%'')
                                   WHERE cm.user_id = _user_id
                                   GROUP BY e.emoji
                                   ORDER BY count DESC
                                   LIMIT 10)
            INSERT INTO top_used_emojis (user_id, emojis)
            VALUES ( _user_id,
                     (SELECT array_agg(emoji) FROM last_top_used)::TEXT[]
            )
            ON CONFLICT (user_id)
            DO UPDATE
            SET emojis = (SELECT array_agg(emoji) FROM last_top_used)::TEXT[]
            RETURNING emojis::TEXT[]
        INTO query_result;

        RETURN query_result;

    END;
'LANGUAGE plpgsql;

6. Everything. It works for small volumes.

Necessary clarifications

Starting from some point, we, of course, will perform this procedure no more than once in N times. When adding each message - it's too much.
"Favorites" and "most frequently encountered" are not exactly the same thing, but we are fine.
I still have some difficulties with EXPLAIN - seqscan, it seems, has learned to recognize, but bones are not bones yet.
Estimated number of users - 1 million, estimated average number of messages per user - 100

Places that haunt me

Concatenation in the FROM clause:
```
FROM chat_messages cm
                 JOIN emojis e
                        ON (cm.text LIKE '%' || e.emoji || '%')
```
The DML engine (or who sits inside there) - does it somehow cache the results of this very concatenation?
Maybe it's worth filling the table emojiswith lines in advance '%...%'?
The operator is right LIKEthere. Whether there are more productive decisions?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

Dmitry Belyaev, 2020-09-13
@EchoStan

I would make a field in the top_used_emojis table that stores the number of uses of each intersection of user with emoji and put an index on this field DESC and in Primary I would shove the user_id and emoji fields together.
I would hang triggers on the chat_messages table to recalculate the number in top_used_emojis
Well, I would display it with a simple ORDER BY on an index with a limit

A

antonwx, 2020-09-13
@antonwx

Let the user choose their favorite emoji. Zadolbali your algorithms, knowing better than me what I need.

V

vdem, 2020-09-13
@vdem

Isn't it better to store them in one field in JSON? For example, make two fields - mostUsed (the most frequently used) and recentlyUsed (the last used ones, so that new used emojis can get into mostUsed based on this data). IMHO for such functionality it is not necessary to start a separate table.
PS And of course, store the number (or better, the frequency of occurrence) of use cases, in order to either add to mostUsed, or remove it from there in favor of emoji from recentlyUsed.