close

Menu

Contact Info

Folow us on social

monitor bucardo replication lag using circonus

  • Denish Patel
  • 1st March 2012
  • postgresql
         I have been using circonus for monitoring, trending and alerting for any database metrics for quite a long time now. The circonus interface makes the monitoring, trending and alerting setup painless and you can see graph flowing in minutes. Another good thing about Circonus is that you can monitor anything that you can query from database ! This week, the task at my hand was to find a way to monitor bucardo replication lag.  `bucardo_ctl status sync_name` provides very important information that you can rely for trending and alerting purposes.

$ bucardo_ctl status my_slave
Sync name: my_sync
Current state: WAIT:22s (PID = 19500)
Type: pushdelta
Source herd/database: slave_herd / master_herd
Target database: my_slave
Tables in sync: 318
Last good: 23s (time to run: 1m 21s)
Last good time: Feb 29, 2012 15:27:14 Target: my_slave
Ins/Upd/Del: 142 / 0 / 0
Last bad: 1h 45m 9s (time to run: 19m 57s)
Last bad time: Feb 29, 2012 13:42:29 Target: my_slave
Latest bad reason: MCP removing stale q entry
PID file: /var/run/bucardo/bucardo.ctl.sync.my_sync.pid
PID file created: Wed Feb 29 13:42:33 2012
Status: active
Limitdbs: 0
Priority: 0
Checktime: none
Overdue time: 00:00:00
Expired time: 00:00:00
Stayalive: yes Kidsalive: yes
Rebuild index: 0 Do_listen: no
Ping: yes Makedelta: no
Onetimecopy: 0

All the information provided by `bucardo_ctl status` command is important but most interesting thing to monitor is “Last good:”. Last good shows bucardo replication lag on slave server.

Trending in circonus:

Now, I have metric identified for the monitoring. Next step is to find the best way to put the metric into the monitoring tool. After some poking around the output and ways to monitor stuff, I decided to grab the SQL query from bucardo_ctl perl script and stick it into circonus monitoring. Most of the time spent for setting up this monitor was to grab right query from big perl script (bucardo_ctl) and mapping the metric required from the query. After that, here is the query that I plugged  into Circonus in no-time.

SELECT
‘bucardo_last_good’ , round(extract(epoch FROM now()-ended))
FROM
(SELECT * FROM bucardo.q WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’
UNION ALL
SELECT * FROM freezer.master_q
WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’) AS foo
WHERE ended is NOT NULL AND aborted IS NULL
ORDER BY ended DESC LIMIT 1;


Alerting in circonus:

bucardo_ctl status shows Last good status to “unknown” if replication is broken.

Name Type State PID Last_good Time I/U/D Last_bad Time
===========+=====+========+====+=========+=====+=====+========+=====
my_sync| P |WAIT:35s|7620| unknown | | |36s |1m58s

       In circonus, you could setup rules and relevant severity levels. The most important part is that if the query doesn’t return any row it should page(“unknown” condition) . Circonus provides rule for alerts in case of a metric is absent. Now, I am all set with alerts as well.

Yay! bucardo replication is under monitoring and trending without any hassle! Hopefully, this post will help you next time when you try to put bucardo replication lag under monitoring.

Join the conversation

2 Comments

  1. “Last good” only changes when replication occurs, so for a database that has infrequent changes, “Last good” could climb quite high and the replication could still be fine.

Leave a comment

Your email address will not be published. Required fields are marked *