Which application stack for a highly loaded service to choose?

evnuh2015-02-20 01:13:10

Highload

evnuh, 2015-02-20 01:13:10

In general, we are sawing a service for collecting statistics and analytics. Will have to collect all sorts of metrics from sites. At the same time, track visitors, collect hits and store their sessions. The minimum required load is 2000rps, and so that you can immediately get 10k in the future on scaling alone :)
I would like to know from experienced people who used which stack of technologies, in particular, they are interested in:

What is the front? Obviously, you don’t want to cut your web server, but according to the benchmarks, nginx gives out 12000rps. There is yandex phantom, but I did not find anything on the benchmarks, also without documentation as such. I would like to immediately and with ssl, load balancing, http/2 :)
How to communicate with c++ workers? fcgi? Or maybe add a module to nginx that, for example, will communicate with workers via zeromq?
DB. Based on the current conditions, it is clear that it will be necessary to collect all sorts of different information, therefore the relational approach does not fit very well here. What is NoSQL for fast writing and rare reading? I dug up Tarantool, it seems to be good

In general, all that I myself found are excerpts without explanation of what caused such a choice, so I appeal to those who implemented it.

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Dmitry, 2015-02-20
@evnuh

To collect statistics, it is very logical to use append only databases, whose write performance often plays a decisive role in the choice. Most likely, you, like many others, will not issue reports on the fly, but will generate them on demand for some time, and generate some of the most basic / popular ones ahead of time, and for you the sampling time will not be the most important criterion.
Disk space costs relatively little today, and even 20% overhead is acceptable for a project with such loads. It all depends on the format of the messages you want to receive and how you decide to store them.
As a database, you can look at Riak (with LevelDBas a backend) or another interesting append only key-value storage like a tarantula: sophia .
But in fact, the decisive factor here is not so much the database itself, but how the information gets into it and on which nodes it should be available. As for me, even the option with regular OS files and fsync () should not be discarded either.
Regarding the web server: without balancing, most likely, it will not be possible to process such a number of requests, although this very much depends on the nature of the requests themselves. It is interesting that you tested that nginx showed you such numbers on one node, most likely the return of one (pair) of pages, each of which ended up in the OS file cache due to frequent access and, accordingly, was given from memory. Here's a hint: reading and writing to memory happen at about the same speed, and nginx allows you to process requests using Lua. And there are already many options: redis pub / sub, pipes, shared memory, etc., maybe you even want to write a module for nginx in C.
Most likely, you will accept json of various variations, and there are 2 options: either write messages immediately to disk and then post-process, or parse the data and then write the results. I can’t advise here, you should be more aware of what is more logical at this stage. But keep in mind that each operation at the stage of processing a request from a client reduces your rps.
Another important point to consider here is that 12krps from one host != 12krps from 12k hosts. Each of the nginx connections will have to multiplex, which will also take time.

OnYourLips, 2015-02-20
@OnYourLips

Start small. With a service-oriented distributed architecture.
It's too early to think about iron.

DigitalSmile, 2015-02-20
@DigitalSmile

If the project is with a serious budget, then a piece of the data center, a load balancer in the form of a piece of iron from f5 ( https://f5.com/products/platforms/appliances ). In the software part, nginx is a proven solution (if you distribute it on 20 nodes with balancing, then the load will be 12000 * 20). What will happen behind the frontend completely depends on the architecture of your application (there may be sharding, another balancer, etc.).
If the budget does not allow, you can forget about 10000rps (you either need to hire very cool high-load programmers, or see the option above). Take any cloud service (Amazon, Jelastic, etc) and deploy a virtual structure there. The performance there will depend entirely on the cloud and your code.
As for the database, take your time to throw away relational databases. Check out the experience, for example here www.sarahmei.com/blog/2013/11/11/why-you-should-ne... Not every architecture is suitable for NoSQL.

Sergey, 2015-02-20
@begemot_sun

Use Erlang.
www.ostinelli.net/a-comparison-between-misultin-mo...
If you need anything I can help.
Your main problem will be the creation of aggregates in various sections based on the event log.

hiloader, 2015-02-20
@hiloader

nginx + cppcms.com/wikipp/en/page/main (FastCGI)

Moxa, 2015-02-21
@Moxa

Write in java, on my laptop jetty with tomcat gives 55k rps, netty - 90k, undertow - 130k, I'm now sawing my web server, I reached 210k rps