How to deal with large amounts of data?

S

Sergey Ilichev2021-04-15 03:56:56

PHP

Sergey Ilichev, 2021-04-15 03:56:56

Hello everyone)

I have never worked in companies where there is a large amount of data.

There is a table where I don’t know exactly how much data, but for a couple of tens of millions, probably, with indexes it weighs more than a dozen gigs, so even indexes are not easy to hang there - it takes up a lot of disk memory.

There is a grid in which you need to display the whole thing with filters on different fields, plus the ability to export with an unlimited number of rows, plus in some filters, drop-down lists that need to be formed also by a query to the database on the fly.

All this needs to be pulled via api from our other service.

It turns out we make such requests -
- getting data taking into account pagination, these are 10 records
- getting data on the total number of records for building pagination, this is SELECT count(id)
- Then the request is twitching to get data for export (you can think about how to refactor so that the data does not twitch when loading the page, but only during export) - all data from considering filters. So far, I have limited the maximum to 10,000 records, but for good, you probably need millions for statistics.
- Queries for each drop down list in filters - SELECT distinct field_name
- Query when filtering and sorting - SELECT * FROM some_table WHERE field_name LIKE '%value%'

When the page loads, all queries are sent except applying filters if they are not set by default.

It turns out that you need to process queries to a table with millions of data a bunch of times. Now it falls off by timeout when filtering, but it's not clear how to refactor. If you do a filter by id, then more or less, and if by other fields, then it falls off for now. There are many fields, different dates, guid, project names, data from a json type field, prices.

You can't bet on all indexes, especially since one index can add 5-10 gigs to the weight.

Who worked with such volumes of data, how to do it in general, so that everything works smartly.

On the postgresql server.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

F

FanatPHP, 2021-04-15
@first-programmer

I don't want to swear, but the question is very incoherent and mixes real problems with ridiculous fantasies.
And the problem here is not in ignorance of how to work with large databases, but in the inability to work with the database as a whole.
You should immediately forget about the idea "you can't put an index on everyone". Where the index is needed, it should be without options. Another thing is that it's stupid to stick indexes on all fields that are being searched - this is also stupidity. Only one index can be used in a query, and indexes on the second or third field will already be useless. It is necessary to analyze queries and, possibly, make composite indexes.
Kindergarten request like '%...%' is a separate horror. It is necessary to look at full-text search. Better yet, avoid it altogether. As a last resort, use external search services such as elastic. And just don't say that this like you have goes through a field like jason or "separated by commas"
But the worst nightmare, of course, is select distinct for filters. That is, the inability to design a database at the most basic level, a lack of understanding of the very initial principles of relational databases, normalization. These are the principles to start with. In then already grasping at large volumes. It is obvious that the fields on which you are going to do "distinct" - these should be separate tables, from which the main table will simply have an id. field of 4 bytes.
It is not clear where the fantasies about gigabyte indexes came from, by the way. Most of the fields in a normal database are no more than a dozen bytes. That is, the index is tens of megabytes, not "gigabytes".
In general, not abstract arguments about large volumes would look much better here, but a specific request that "falls off". With the obligatory result EXPLAIN
And the answer to the abstract question "how to work with large volumes" is very simple: just like with small ones. Relational databases were originally designed for large sizes. That is, you just need to be able to work with the database. Read about the relational model, normalization, indexes, query optimization.
Specifically for the grid, you need to look towards the Elastic / Sphinx. In the sense that not only for full-text search, but that all the filters that are in the selection are hammered into the search index. And all selections - through a search service, and not through a direct query to the database

S

shurshur, 2021-04-15
@shurshur

70 GB is not a huge amount at all. People operate with terabytes and even more. The main problem is not in the volume of the table, but in not reading it entirely (full scan) when executing the query. And here is the main garbage: only the condition like '% word%' in any case requires to look at each line, which means there will be a full scan. It is useless to build regular indexes on this field. There are all sorts of full-text ones, but in the general case they also need to be properly prepared in order to work acceptable. The solution may depend on the problem. For example, if these are keywords in the form of a text string with spaces or other separators, then they can be placed in a separate table in separate lines and indexed there, full-text search will be redundant here.

B

batyrmastyr, 2021-04-22
@batyrmastyr

- getting data on the total number of records to build pagination, this is SELECT count(id)

1. count(*), not count(id)
2. if you're not really interested in an absolutely exact value for millions of results, then you do an estimate of the number , it's easier to start with EXPLAIN <query text> you can get an estimate of the number of results. We decided for ourselves that if the estimate is less than 50,000 rows, then after that we do the usual SELECT count (*) to get the exact number.

Then the request is twitched to obtain data for export

1. Perhaps you need to get rid of this in the first place. The person pressed the "export" button - you export, and before that there is no point in twitching. Filters can be obtained either by clicking or from the referer
2 header. If you need absolutely all the data, then put the export task in the queue and execute it in a separate process, save it to a file. For the user, draw progress and display it in the button pressed by the user, although you can stupidly display a list of "ordered" downloads and download links on a separate page.

Queries for each dropdown list in filters - SELECT distinct field_name

It is possible to upload the output of such queries to a materialized view / lookup table / ENUM with some periodicity . To update such directories "in real time", you can hang up a trigger for insertion into the main table that will do INSERT INTO dictionary (value, column_oid) ON CONFLICT / ALTER TYPE ADD VALUE IF NOT EXISTS
Then in the main table, create a field next to the identifier in the directory and index it already.

Query when filtering and sorting - SELECT * FROM some_table WHERE field_name LIKE '%value%'

1. If your values are long (from 8 - 10 characters), then you should try trigram indices. But on short values, they can slow down the search by one and a half to two times.
2. Full text search. In particular, there is a search for a token by the prefix ts_tsquery('word :* ') (it will quickly find both "word" and "dictionary", but will not find "single-word")
3. For fields on which you will make dictionaries, it is better to do a search through the dictionary SELECT * FROM table WHERE column_dictionary_id IN (SELECT id FROM dictionary WHERE value LIKE '%text%'). In the dictionary, you probably have an order of magnitude - three less values, and postgres can easily chew several hundreds or thousands of values in IN.

There are many fields, different dates, guid, project names, data from a json type field, prices.

Make more use of functional and partial indexes.
For example, we have cadastral numbers. The trigram index for them weighs 56 MB, and BTREE for numbers truncated to cadastral quarters - 15 MB, in the search for "cadastre_id LIKE '11:22:333333:1%'" added "AND to_quarter(cadastre_id) = '11:22: 333333'", but the search itself is an order of magnitude faster (~5 ms instead of 50 - 70).
The main thing is not to forget about the cost of these very functions - an index on to_quarter can be built only 1.5 times longer than a non-functional one, if you do LEFT(cadastre, -(position(':' IN reverse(cadastre))), or maybe 100 times if you use a regular expression.

You can't bet on all indexes, especially since one index can add 5-10 gigs to the weight.

If you have not updated yet, then update to the 13th version, where the size of BTREE indexes has been reduced by 3 times. Well, look, maybe you need GIST, GIN or BRIN indexes somewhere.

C

ComodoHacker, 2021-04-15
@ComodoHacker

It looks like you have grown to the position of "Database Developer". That is, you need a person who knows well the operation of the DBMS (in your case, PostgreSQL) and its performance optimization.

D

Dimonchik, 2021-04-15
@dimonchik2013

this is a tibe
https://github.com/mkabilov/pg2ch
(well, Klihaus himself, if it suddenly didn’t reach)
but in general - a book
https://dmkpress.com/catalog/computer/databases/97...
though, ironically , of the columnar ones, only Hbase can be pulled there, but at least there will be an idea that there is no universal one and the base is selected for a set of tasks