- Denish Patel
- 1st March 2012
$ bucardo_ctl status my_slave
Sync name: my_sync
Current state: WAIT:22s (PID = 19500)
Source herd/database: slave_herd / master_herd
Target database: my_slave
Tables in sync: 318
Last good: 23s (time to run: 1m 21s)
Last good time: Feb 29, 2012 15:27:14 Target: my_slave
Ins/Upd/Del: 142 / 0 / 0
Last bad: 1h 45m 9s (time to run: 19m 57s)
Last bad time: Feb 29, 2012 13:42:29 Target: my_slave
Latest bad reason: MCP removing stale q entry
PID file: /var/run/bucardo/bucardo.ctl.sync.my_sync.pid
PID file created: Wed Feb 29 13:42:33 2012
Overdue time: 00:00:00
Expired time: 00:00:00
Stayalive: yes Kidsalive: yes
Rebuild index: 0 Do_listen: no
Ping: yes Makedelta: no
All the information provided by `bucardo_ctl status` command is important but most interesting thing to monitor is “Last good:”. Last good shows bucardo replication lag on slave server.
Trending in circonus:
Now, I have metric identified for the monitoring. Next step is to find the best way to put the metric into the monitoring tool. After some poking around the output and ways to monitor stuff, I decided to grab the SQL query from bucardo_ctl perl script and stick it into circonus monitoring. Most of the time spent for setting up this monitor was to grab right query from big perl script (bucardo_ctl) and mapping the metric required from the query. After that, here is the query that I plugged into Circonus in no-time.
‘bucardo_last_good’ , round(extract(epoch FROM now()-ended))
(SELECT * FROM bucardo.q WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’
SELECT * FROM freezer.master_q
WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’) AS foo
WHERE ended is NOT NULL AND aborted IS NULL
ORDER BY ended DESC LIMIT 1;
Alerting in circonus:
bucardo_ctl status shows Last good status to “unknown” if replication is broken.
Name Type State PID Last_good Time I/U/D Last_bad Time
my_sync| P |WAIT:35s|7620| unknown | | |36s |1m58s
In circonus, you could setup rules and relevant severity levels. The most important part is that if the query doesn’t return any row it should page(“unknown” condition) . Circonus provides rule for alerts in case of a metric is absent. Now, I am all set with alerts as well.
Yay! bucardo replication is under monitoring and trending without any hassle! Hopefully, this post will help you next time when you try to put bucardo replication lag under monitoring.
- Running VACUUM FULL in Non-blocking Mode
- Connection Scaling
- PSQL Helper: Managing Connections and Simplifying Queries
- Vacuum Those MVs!
- Working With Repmgr: Using Other 3rd Party Tools for Setting up a Standby
- Working with Amazon Aurora PostgreSQL: dag, standby rebooted again!
- Working with Amazon Aurora PostgreSQL: what happened to the stats?
- How to set application_name for psql command line utility?
- Is there a limit on number of partitions handled by Postgres?
- Postgres 11 partitioning
“Last good” only changes when replication occurs, so for a database that has infrequent changes, “Last good” could climb quite high and the replication could still be fine.
that’s a good point Jesse.
Leave a comment