Answer the question
In order to leave comments, you need to log in
How to store scraping data on the site?
How will it be correct to store a large amount of parsing data, for example, I have 5 thousand links, each of which has a table with 5 columns, 5000 rows.
And each link will need to be parsed once every n-days, and the result will be saved without deleting the old one.
That is, as a result, a lot of data will turn out.
What is the correct way to store all this data?
Answer the question
In order to leave comments, you need to log in
You probably need to apply the same approach as in version control systems. Then each result of one URL can be updated (commit) without overwriting the previous one. In this case, the place in the database will increase by delta (diff).
5000 * 5000 = 25,000,000 times a day we update
all
links
25,000,000 * 365
=
9,125,000,000,000
per
year archived) all the records that were before - before the update
, that is, you will have the entire history of changes in the second table.
Correspondingly partition the archive table for about 2 weeks
. This way you will ensure that the current data will be delivered quickly
Archived longer - but this is understandable
Then the data for the past periods can be backed up leaving only the latest records of changes (for example, only data for six months can be stored in the archive)
If necessary, everything can be taken out and counted,
but practice says that such frantic volumes, if taken out, will be considered for a very long time, and due to the prescription of time, they are often irrelevant. According to last year's links, obviously no one will conduct analytics.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question