F
F
Fridary2020-01-05 23:43:38
Python
Fridary, 2020-01-05 23:43:38

What is the best way to store data from a site in ClickHouse?

I had a task to write my own system like Yandex Metrica: users will put a js code on the site, it will save information about all visits for analytics and statistics. My choice fell on ClickHouse.
What is the most efficient way to build such an architecture? It is planned to do at least 100k records per day. My main gag is how best to transfer data from a website to a server for further processing in ClickHouse. I see it like this:

  1. I create a server with nginx and ClickHouse on Ubuntu in one place, to which data from sites will be sent
  2. Sites send XMLHttpRequest (ajax post) requests to this server
  3. nginx on python/php accepts requests and puts them in redis
  4. redis puts data into ClickHouse every N seconds with a python/php script via cron

Will such a scheme work stably? What can be improved here?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
Dmitry Belyaev, 2020-01-06
@bingo347

The scheme is generally working, but it is still worth foreseeing some nuances:

I create a server with nginx and ClickHouse on Ubuntu in one place
Are you sure that one single machine loaded with nginx and CH and redis and php/python at the same time will handle your load?
I would immediately pledge that sooner later CH will leave for a separate machine (or even 2), redis + php / python will live in pairs on many machines, and nginx will balance the load from a separate machine.
There are 2 points at once:
First, with a billion records, CH may well recalculate indices for a couple of minutes. +1000 records per 10 inserts is certainly better than +1000 records per 1000 inserts... but still noticeably worse than +1000 records per 1 insert.
Secondly, at peak load, you may run out of memory, and redis will go to disk, losing all the speed from inMemory work. Of course, you will also run into the number of active threads under php and start balancing it. So one of the solutions will be already described above, that each machine with php / python has its own redis. And there, how lucky with the load.
But still, I would immediately unload from redis to CH smarter than stupidly every N seconds we unload everything that is. It is better if there is an unloading both in time (at the same time less often) and in the amount of data to be recorded.
Well, one more thing, think right away how to divide data in redis by belonging to the recording interval

D
Dimonchik, 2020-01-06
@dimonchik2013

https://www.youtube.com/watch?v=pbbcMcrQoXw

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question