Answer the question
In order to leave comments, you need to log in
How to deal with large amounts of data?
Hello everyone)
I have never worked in companies where there is a large amount of data.
There is a table where I don’t know exactly how much data, but for a couple of tens of millions, probably, with indexes it weighs more than a dozen gigs, so even indexes are not easy to hang there - it takes up a lot of disk memory.
There is a grid in which you need to display the whole thing with filters on different fields, plus the ability to export with an unlimited number of rows, plus in some filters, drop-down lists that need to be formed also by a query to the database on the fly.
All this needs to be pulled via api from our other service.
It turns out we make such requests -
- getting data taking into account pagination, these are 10 records
- getting data on the total number of records for building pagination, this is SELECT count(id)
- Then the request is twitching to get data for export (you can think about how to refactor so that the data does not twitch when loading the page, but only during export) - all data from considering filters. So far, I have limited the maximum to 10,000 records, but for good, you probably need millions for statistics.
- Queries for each drop down list in filters - SELECT distinct field_name
- Query when filtering and sorting - SELECT * FROM some_table WHERE field_name LIKE '%value%'
When the page loads, all queries are sent except applying filters if they are not set by default.
It turns out that you need to process queries to a table with millions of data a bunch of times. Now it falls off by timeout when filtering, but it's not clear how to refactor. If you do a filter by id, then more or less, and if by other fields, then it falls off for now. There are many fields, different dates, guid, project names, data from a json type field, prices.
You can't bet on all indexes, especially since one index can add 5-10 gigs to the weight.
Who worked with such volumes of data, how to do it in general, so that everything works smartly.
On the postgresql server.
Answer the question
In order to leave comments, you need to log in
I don't want to swear, but the question is very incoherent and mixes real problems with ridiculous fantasies.
And the problem here is not in ignorance of how to work with large databases, but in the inability to work with the database as a whole.
You should immediately forget about the idea "you can't put an index on everyone". Where the index is needed, it should be without options. Another thing is that it's stupid to stick indexes on all fields that are being searched - this is also stupidity. Only one index can be used in a query, and indexes on the second or third field will already be useless. It is necessary to analyze queries and, possibly, make composite indexes.
Kindergarten request like '%...%' is a separate horror. It is necessary to look at full-text search. Better yet, avoid it altogether. As a last resort, use external search services such as elastic. And just don't say that this like you have goes through a field like jason or "separated by commas"
But the worst nightmare, of course, is select distinct for filters. That is, the inability to design a database at the most basic level, a lack of understanding of the very initial principles of relational databases, normalization. These are the principles to start with. In then already grasping at large volumes. It is obvious that the fields on which you are going to do "distinct" - these should be separate tables, from which the main table will simply have an id. field of 4 bytes.
It is not clear where the fantasies about gigabyte indexes came from, by the way. Most of the fields in a normal database are no more than a dozen bytes. That is, the index is tens of megabytes, not "gigabytes".
In general, not abstract arguments about large volumes would look much better here, but a specific request that "falls off". With the obligatory result EXPLAIN
And the answer to the abstract question "how to work with large volumes" is very simple: just like with small ones. Relational databases were originally designed for large sizes. That is, you just need to be able to work with the database. Read about the relational model, normalization, indexes, query optimization.
Specifically for the grid, you need to look towards the Elastic / Sphinx. In the sense that not only for full-text search, but that all the filters that are in the selection are hammered into the search index. And all selections - through a search service, and not through a direct query to the database
70 GB is not a huge amount at all. People operate with terabytes and even more. The main problem is not in the volume of the table, but in not reading it entirely (full scan) when executing the query. And here is the main garbage: only the condition like '% word%' in any case requires to look at each line, which means there will be a full scan. It is useless to build regular indexes on this field. There are all sorts of full-text ones, but in the general case they also need to be properly prepared in order to work acceptable. The solution may depend on the problem. For example, if these are keywords in the form of a text string with spaces or other separators, then they can be placed in a separate table in separate lines and indexed there, full-text search will be redundant here.
- getting data on the total number of records to build pagination, this is SELECT count(id)
Then the request is twitched to obtain data for export
Queries for each dropdown list in filters - SELECT distinct field_name
Query when filtering and sorting - SELECT * FROM some_table WHERE field_name LIKE '%value%'
There are many fields, different dates, guid, project names, data from a json type field, prices.
You can't bet on all indexes, especially since one index can add 5-10 gigs to the weight.
It looks like you have grown to the position of "Database Developer". That is, you need a person who knows well the operation of the DBMS (in your case, PostgreSQL) and its performance optimization.
this is a tibe
https://github.com/mkabilov/pg2ch
(well, Klihaus himself, if it suddenly didn’t reach)
but in general - a book
https://dmkpress.com/catalog/computer/databases/97...
though, ironically , of the columnar ones, only Hbase can be pulled there, but at least there will be an idea that there is no universal one and the base is selected for a set of tasks
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question