Is it worth using Mongo and Node.js for a service similar to Yandex.Metrica and Google Analytics?

Puma Thailand2013-11-20 08:28:13

PHP

Puma Thailand, 2013-11-20 08:28:13

There is one project for analyzing data from visiting sites, in fact, it collects user data from page views (ip, countries, cities, browsers, screen extensions, etc.) and then builds reports.
There is a lot of data.
Everything is now spinning on perconadb + php. Innodb tables
When there is an influx of traffic on the sites, percona shuts up on the record.
Also, the construction of reports does not work very fast, especially for long periods of time. Queries and indexes were optimized as best they could, even the data for this was denormalized and normalized in places.
The question is whether it makes sense to switch to mongo for data?
Does it make sense to transfer the place through which statistics are collected to node.js
I want to leave the backend with reports on php.

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Andrew, 2013-11-20
@kaasius

The node here will not save you in any way, because the bottleneck is in your database.
Monga will only delay the inevitable.
And the inevitable is to go to the queue.
That is, as I would do - on a machine that collects statistics (receives requests from sites), all data is written to the queue. You can use RabbitMQ for example. And this machine does nothing else.
The second machine takes the data out of the queue and puts it in the database, doing some additional processing of this data.
At the same time, you can have several first and second cars. And you can run it all on one. In any case, it will work faster, and not 100 threads will write to the database, but one, pre-consolidating the data.

Timur Shemsedinov, 2013-11-20
@MarcusAurelius

A node can help if you implement preliminary data consolidation in RAM. About queues - this is the right idea. And also, you need to optimize the structure of the database, if the insertion of records is slower than their stream at the input, the queue will gradually begin to choke, and then how much RAM is enough, the queue will simply smooth out the load. But you still need to make a selection from the database when generating reports, and yours is also slow, so think about the structure of the database, optimize indexes, execution plans.

direct_inject, 2013-11-20
@direct_inject

MongoDB + sharding.

BasilioCat, 2013-11-20
@BasilioCat

If you have a problem recording, make the recording faster. Mongo will save you in the sense that recording can be done not on one server / disk, but on several. The same effect can be achieved by replacing a slow spindle (do you have SATA?) with an SSD, or with two SSDs in a stripe, or with 4 ssds for each of the heavy tables =) And, most likely, you are writing too much data to the database. Write less =) Reading, aggregation, gluing and so on must be done on a replica (or even several) of this database on another machine. In terms of IOPS, it should be no worse than the master - that is, also on an SSD. If the replicas are plugged in, this will not affect the real-time recording - just the replicas will start to lag behind the master.
Well, a banal advice - add RAM, set up an innodb buffer pool - maybe it's just inefficiently used.
It may make sense to transfer the entry to the node if complex calculations are not made when the counter is pulled, but simply the input parameters are written to the database. But if PHP can handle it now, then why change? Take it to a separate machine - an extra 100 (200 for two machines) bucks a month is much cheaper than paying a programmer to rewrite this stuff.

Alexey, 2013-11-20
@fuCtor

I collected statistics on RabbitMQ + Mysql + PHP. In an easy way, I grinded several thousand messages per second into one thread (PHP). So the node + queue to the front at least will reduce the load on the rest + will make it possible to make, as we already wrote earlier, an easily scalable system.

Dmitry Kaigorodov, 2014-08-12
@Kaigorodov

Yes. In the sense of Mongo or some other under-sql for big data. But a pure solution on a distributed database, such as casandra, is better. If it's already "no-sql and not big data", but there is a lot of data, then Riak.