M
M
Max Payne2019-05-28 20:49:36
Python
Max Payne, 2019-05-28 20:49:36

How to work with Postgresql -> BigQuery?

And so, there is some application in Python that collects data into a Postgresql database. There is not much data (about a gigabyte = a million records) and there would be no problem if some data could not become outdated - that is, they need to be updated sometimes. The client wants to use BigQuery for data analysis. Apparently, he has some other data with which he wants to compare these data or work with them.
What is the best way to update data in the database and then in BigQuery?
I have a few options:
1. Python collects new data during the day, updates the data that it needs to update, at the time of completion of its work, the data is unloaded from postgresql to csv or json, then loaded into BigQuery through the google-cloud-bigquery library, the main table loaded the table is renamed.
Obvious disadvantages are the lack of transactions in Bigquery.
2. Python collects new data throughout the day, updates the data that it needs to update. Once or twice a day, data is uploaded from Postgresql to csv or json ... and so on, like the previous option, the minuses are the same, but there is another minus that the upload may coincide with the moment the script is running.
3. Python collects new data throughout the day, updates the data that it needs to update. Upon request to the API (roughly speaking, by clicking on the button), the data is unloaded from postgresql to csv ... and so on as before. The disadvantages are the same, but the narrow-mindedness of the client is added, which gets the opportunity to poke a button once a second, from which nothing good will come of it.
If the data did not need to be updated, everything would be obvious, but BigQuery is not designed to update data - such requests (UPDATE) can be made only 1000 per day, and this is quite small, because if at least once we update more than a thousand rows from million (and this is 0.1%), there is no point in working with data like that.
Maybe someone has already solved this problem? What are some interesting libraries for working with BigQuery other than google-cloud-bigquery?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
D
Dimonchik, 2019-05-28
@dimonchik2013

If the data did not need to be updated, everything would be obvious, but BigQuery is not designed to update data - such requests (UPDATE) can be made only 1000 per day, and this is quite small, because if at least once we update more than a thousand rows from million (and this is 0.1%), there is no point in working with data like that.

look: if somewhere something is done exactly like this, always think - why is it so?
and as soon as you understand - the horizons of consciousness will expand and the world will sparkle with new colors
Bigquery is not intended for Update - well, the dollar exchange rate was 65.25 now, and in a second 65.26 - what are you going to update here? The Finns became world champions the day before yesterday, what can be changed here? a visitor came - the log fit in, then another one - and then what to update? Or did he not come?
in BigQuery you stupidly add raw data, and process it with aggregation, window functions and all that - that's how you build the work, you don't need to copy the SQL database

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question