S
S
Sergey Protasov2019-01-15 17:20:28
big data
Sergey Protasov, 2019-01-15 17:20:28

How to choose from a large log file?

Hello, friends.
There is a company that sells banner views on the Internet.
For all the time, 300 million records have accumulated in the log. The log consists of the following entries:

  • user id (hash code)
  • viewing date

Recently, 10 million entries are added to the log file per day. Each user for all time makes from 10 to 100 views, if you look at the median.
It is required to make a system that in less than 1 sec. returns the number of unique users who viewed the banners.
The maximum date interval is 3 years, the minimum is 1 day, you cannot request today's date. The computer has infinite HD, 2 processor cores, 4 GB of RAM.
What technologies can be used to implement this?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Sergey, 2019-01-15
@begemot_sun

If in the forehead: https://ru.wikipedia.org/wiki/ClickHouse

V
Vladimir Olohtonov, 2019-01-15
@sgjurano

The problem sounds like a description of a segment tree, since the operation of union of sets is associative.
www.e-maxx-ru.1gb.ru/algo/segment_tree
I think you can make its implementation with data storage on disk, and keep only offsets of the beginning/end of daily ranges in memory. There will be only about a thousand of them in 3 years.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question