WebDev2020-06-08 12:40:44
WebDev, 2020-06-08 12:40:44

How to properly move analytics data to a separate storage?

There is a database on MySQL, which contains data on music tracks and analytics on them.
There are a lot of analytics, so at some point it was decided to move the table with analytics to ClickHouse.
This helped a lot. The speed of query execution has increased by an order of magnitude.
But this caused another problem. Now making a query from two tables is a big problem. For example, I need to select all tracks (table in MySQL) that were created in January and select the 30 most listened to (ClickHouse). In order to execute such a query, you need to either select track ids in MySQL, then substitute them into the ClickHouse query, or store a duplicate of the track table in ClickHouse. Both options are terrible.
In general, the transfer of analytics to CH is a great thing, but what about such inconveniences? How do you work with multiple stores for linked data?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
Ivan Shumov, 2020-06-08

Well, the problem is in a conceptual misunderstanding of what analytics and Warehouse or Data Lake are.
First, let's define how analytics differs from metrics, aggregates, and reports.

  • Analytics is done by people, not regularly, speed is not important for them
  • Reports happen automatically on a regular basis. Periodicity is important to them.
  • Metrics are needed to measure something in a time series
  • Aggregates - collection of data from different sources regardless of other factors

If we are still talking about analytics, then it should not refer to live data at all. It is put into a separate Warehouse or Data Lake and analyzed as needed. The main tools are Power BI, Tableau or even the notorious Excel.
If we are talking about reporting, then in order not to load the live system, the same rules apply to it as for analytics.
If we are talking about metrics, then a separate service is built for them, from which dashboards, APIs and stuff like that are obtained

Vitaly Karasik, 2020-06-08

store a duplicate table with tracks in ClickHouse

I am for this option. That is, everything you need for analytics is stored in ClickHouse.

Roman Mirilaczvili, 2020-06-09

Recommender systems typically do not necessarily provide real-time data. Therefore, I propose another option for working with data:
some background process will receive some metrics from the service API and will temporarily store data in Mysql in an amount sufficient for batch sending to ClickHouse. Another process will periodically make requests to ClickHouse, and store the results of recommendations in Mysql. Thus, all requests from the service API can be processed by referring only to Mysql.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question