In recent weeks I have been massively overhauling the monitoring and alerting infrastructure. Most of the low-level box checks are easily handled by CloudWatch, and some of the more sophisticated trip-wires can be handled by looking for patterns in our logs, collated by LogStash and exported to Loggly. In either case, I have trip wires handing off to PagerDuty to do the actual alerting. This appeals to my preference for strong separation of concerns – LogStash/Loggly are good at collating logs, CloudWatch is good at triggering events off metrics, and PagerDuty knows how to navigate escalation paths and how to send outgoing messages to which poor benighted bastard – generally and almost always me – has to be woken at 1:00 AM.
One hole in the new scheme was a simple reachability test for some of our web end points. These are mostly simple enough that a positive response is a reliable indicator that the service is working, so sophisticated monitoring is not needed (yet). I looked around at the various offerings akin to Pingdom, and wondered if there was a cheaper way of doing it. Half an hour with the (excellent) API documentation from PagerDuty, and I’ve got a series of tiny shell scripts being executed via RunDeck.
#!/bin/bash
if [ $(curl -sL -w "%{http_code}\\n" "http://some.host.com/api/status" -o /dev/null) -ne 200 ]
then
echo "Service not responding, raising PagerDuty alert"
curl -H "Content-type: application/json" -X POST \
-d '{
"service_key": "66c69479d8b4a00c609245f656d443f1",
"event_type": "trigger",
"description": "Service on http://some.host.com/api/status is not responding with HTTP 200",
"client": "Infra RunDeck",
"client_url": "http://our.rundeck.com"
}' https://events.pagerduty.com/generic/2010-04-15/create_event.json
fi
This weekend I hope to replace the remaining staff with a series of cunning shell scripts. Meanwhile the above script saves us potentially hundreds of pounds a year in monitoring costs.