How to make a Postgres database fault-tolerant?

U

User15822019-05-04 13:58:42

PostgreSQL

User1582, 2019-05-04 13:58:42

Good afternoon!
I am developing a web service.
The service was written in Java(Spring), front-end - Javascript, database Postgres 10.4
It's all spinning on Ubuntu 16.04
It is highly desirable that the service be available 24 hours 7 days a week.
The main purpose of the database is to store the data structure (which changes infrequently) and logging all operations
performed with the data (SQL database was chosen because it is convenient to work with data through a relational database, and the logs themselves also have a fairly branched structure, interconnected and you need to perform various tricky queries with JOIN, etc.).
Reliability and fault tolerance, if I understand correctly, will consist of several components:
- the reliability of hardware (including the Internet connection),
- the reliability of the database,
- the reliability of the service itself, which is running on the JVM.
In order not to create an unnecessarily large question, I would first like to understand how reliable the Postgres DBMS is.
In particular, I'm interested in how to properly ensure the reliability and non-failure operation of its work?
Planned events (to the best of my understanding of things):
- I plan to do a daily automatic backup of the database itself (using scripts at night),
Old backups will be deleted after a certain period of time.
- using some script or utility to monitor the availability of the database and, in case of availability problems, restart / start the DBMS itself.
Advise what else can be done to ensure the reliable operation of the database.
Perhaps there are some best-practices which operations, queries or their combinations are undesirable for the database and can lead to problems.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

M

Melkij, 2019-05-04
@melkij

how to ensure the reliability and non-failure operation of its operation?

Use adequate hardware and do not interfere with the work of the base with playful pens.
Can you please explain how this is related in your opinion?
Backups are necessary, but how do they relate to high availability?
Simple question: why?
If the base has developed and could not even rise on its own, then the problem is capital and it is necessary to understand it in detail. By restarting in a circle, you can do something really bad rather than fix something.
Moreover, problems usually start from the fact that the developers have rolled out a new version of the application and it starts to do something strange. For example, they forgot to make an index for a 50GB tablet and put a query with a search on it on the main site. Restarting the database does not fix this, of course. And it only makes it more painful.
I won't impress pg_postmaster_start_time() many years ago - we periodically put minor updates on our databases. And I also highly recommend you upgrade to 10.7 or better next weekend immediately to 10.8.
And so, if you do not touch it, it will work for years. Monitoring and alerts from it, of course, are needed. Some kind of automation when an alert occurs - usually brings more headaches than it helps.
Most of the incidents with the unavailability of the service from the point of view of the application are about blocking levels. Anyone try to do create index instead of create index concurrently and hello waiting for a lock on the entire record in the table. Most alter table forms are here, but they will also block reading.

A

Anton Shvets, 2019-05-04
@Xuxicheta

master-slave replication to the second server with postgres
Here, for example
https://postgrespro.ru/docs/postgresql/11/high-ava...
https://eax.me/postgresql-replication/
https://blog .gita-dev.ru/post/nastrojka-potokovoj-...
https://habr.com/ru/company/pgdayrussia/blog/326996/

N

nrgian, 2019-05-04
@nrgian

No way.
You can only get closer.
Master-slave, for example, will help not to lose PostgreSQL data, but this does not mean that the system will correctly switch to full-fledged operation from the slave when the master fails. The decision to switch will have to be made manually.