Answer the question
In order to leave comments, you need to log in
Django and tables with very large amount of data?
There is a database on postgresql. There is a Django script that collects data.
The problem is that there is a LOT of data. The table with which it is necessary to operate contains, according to conservative estimates, from hundreds of millions of rows. And it looks like it will have to operate in the billions.
All frequent requests - what could be cached - I drove into the redis cache.
Files for the parser - are driven entirely into memory and parsing is done there to speed up, since the files are loaded from a relatively slow connection.
Parsers use celery to load data into multiple streams.
The problem is that on tens of millions requests to the admin panel have already started to fall off due to timeouts.
The speed of data collection also leaves much to be desired. For a day and a half - tens of millions of data have been collected - and hundreds should be collected.
What other approaches can be applied to speed up work with large amounts of data?
The structure of the table with the largest amount of data is:
class CDRData(models.Model):
start_time_of_date = models.DateTimeField(null=True, blank=True)
origination_source_number = models.ForeignKey(ANIData, null=True, blank=True, verbose_name='ANI')
origination_destination_number = models.ForeignKey(DNISData, null=True, blank=True, verbose_name='DNIS')
routing_digts = models.CharField(max_length=32, null=True, blank=True)
origination_host = models.GenericIPAddressField(null=True, blank=True, verbose_name='TERMINATION_IP')
termination_host = models.GenericIPAddressField(null=True, blank=True)
termination_media_ip = models.ForeignKey(MediaIP, null=True, blank=True, verbose_name='TERMINATION_MEDIA_IP')
egress_response = models.ForeignKey(EgressResponse, null=True, blank=True,
related_name='d_egress_resp')
orig_term_release = models.CharField(max_length=32, null=True, blank=True)
egress_code = models.CharField(max_length=64, null=True, blank=True)
pdd = models.IntegerField(null=True, blank=True)
egress_call_duration = models.IntegerField(null=True, blank=True)
cdr_file = models.ForeignKey('ParsedFile', null=True, blank=True)
def __str__(self):
return '{} - {}'.format(str(self.origination_host), str(self.termination_host))
class Meta:
verbose_name = 'CDR data'
verbose_name_plural = 'CDR data'
Answer the question
In order to leave comments, you need to log in
Create many small identical tables to insert (eg for every hour). In these tables, remove all constraints (foreign key, indexes, ...), import csv files into them directly using the database tools (load from if the database can).
At the time of insertion, lower the transaction isolation level to a minimum (MyISAM used to be ideal for such insertions precisely because of the lack of transactions).
These tables can then be linked for selection either through a view, or through more complex partitioning procedures.
If during the day (night?) there are some periods of time when the database is little used, you can run a heavy script that will shift the data from these tables into one large one and add indexes and keys.
It is desirable to do data insertion in one transaction. If through ORM - then bulk_insert.
If all this does not help or you do not like it, then improve the hardware (as much memory as possible) and pull the base settings, but this is already out of hopelessness.
Please note that it is better to write to the database with 3 connections, but in large portions, than with 30 connections, but small ones. Each connection is a separate transaction, and when transactions are closed, the database has to coordinate it with the rest of the current ones.
1. Data collection:
Dig towards queues, workers, etc. Parallel collection will be faster. Moreover, apparently you scan the entire Internet. In short:
2.
Colleagues wrote above about database optimization. It is very important.
Add indexing to fields. (For those on which you make a filter. For example, dates.) You
will immediately feel an increase in speed with requests.
I would look in the direction of relation and data connections, is it possible to simplify the selection, apply something like directories, and check groups of records. For example, by start_time_of_date and origination_source_number.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question