Creating CoreOS Services with Cross Node Dependency using etcd

When I was putting together an architecture for deploying PasTmon sensors across a CoreOS cluster for a previous blog, PasTmon Passive Application Response Time Monitoring a CoreOS Cluster, I wanted to have the Fleet service units coded so the pastmon-sensors would have the pastmon-web as a cross node dependency.  The plan was for the sensors to only start once the web/database service had started, but this dependency needed to operate across all nodes in the cluster.

At first I thought I could achieve this using the unit directives like Requires/Wants etc, so I tried:

simply following the examples shown in the CoreOS documentation.

The unit called pastmon-web-discovery@1.service is a sidekick unit that BindsTo the actual pastmon-web service pastmon-web@%i.service, registering it’s hostname and database port in etcd:

Firing up pastmon-web, with it’s sidekick, followed by the sensors across the rest of the nodes in the cluster, all worked fine. However, if the CoreOS cluster failed or was rebooted, the services came back up out of order and required manual intervention.  It was clear that the [Unit] After and Requires directives only applied to the node the unit was started on, and not across the whole cluster.

Actually, this kind of made sense when I thought about it. The [X-Fleet] section of the unit means just that: “Cross Fleet (cluster)”.  At the time of writing this blog, there does not appear to be any support in this section for cross cluster unit dependencies (though I did find a few discussions around and requesting this feature in the CoreOS forums).

To resolve this I realised I could leverage the existing etcd web service registration as a Pre-Start condition in the sensor units.  The etcd key value has a Time-To-Live (–ttl) of 60 seconds, and is re-registered every 45 seconds, as long as the pastmon-web service it is bound to is running.

So here is my fixed pastmon-sensor unit using the etcd Pre-Start test:

The etcdctl get command will fail with a non-zero return code if the key is not present.  Running the ExecStartPre=, without the ‘-‘ (= instead of =-) causes this to fail starting the unit.

The second highlighted section, above, sets the unit to automatically restart on failure, after a delay of 10 seconds, and to retry forever.

I tested these again, crashing and rebooting the cluster, and they restarted in the correct order everytime – perfect.

All of the code above is available in gbevan/pastmon on GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *