2019-01-02 - Progress - Tony Finch
I'm now most of the way through the server upgrade part of the rename / renumbering project. This includes moving the servers from Ubuntu 14.04 "Trusty" to Debian 9 "Stretch", and renaming them according to the new plan.
Done:
Live and test web servers, which were always Stretch, so they served as a first pass at getting the shared parts of the Ansible playbooks working
Live and test primary DNS servers
Live x 2 and test x 2 authoritative DNS servers
One recursive server
To do:
Three other recursive servers
Live x 2 and test x 1 DHCP servers
Here are a few notes on how the project has gone so far.
- Operational practices
- Ansible - facts
- Ansible - role dependencies
- Ansible - network interface configuration
- Group write access and umask 002
- Safe primary DNS rebuilds
- Explicit
-test
in server names - Keepalived
- Next steps
Operational practices
I ought to be better at writing a checklist of actions for processes
like this. I kept forgetting to do things like renaming host_vars
files and re-running the Ansible inventory build script with the new
name before firing off a reinstall.
It isn't a surprise (especially after the recent DHCP / monit cockup) that there has been rather too much divergence between the Ansible playbooks and the running servers (i.e. it was non-zero) and a hidden accumulation of small bootstrapping bugs. This is mainly due to sticking with an LTS OS release for too long, because there wasn't enough pressure to upgrade.
I'm planning to address this by sticking closer to Debian's release schedule. This means upgrading every other year, except that the 9 "Stretch" -> 10 "Buster" upgrade will be in one year to match up with Debian's schedule.
Ansible - facts
Ansible by default gathers "facts" about target servers when it starts: information about things like the OS, CPU, memory, networking.
When I started using Ansible this seemed like more complication than I needed. It was easier to hard-code things that never changed in my setup. And that remained true for several years.
But I have found a few uses this year:
doh101
uses OS facts (Debian vs Ubuntu, version number) to automatically select the right OpenResty packageAutomatically installing the VMware tools when appropriate without me explicitly stating which servers are VMs and which are hardware
The network interfaces have different names in Trusty and Stretch, so it was most convenient to use facts to parameterize that part of the network configuration
One problem with the latter is that the playbooks do the wrong thing when I'm upgrading a server with a stale fact cache. I have a script for building (most of) my Ansible inventory, which also constructs /etc/hosts and ssh known_hosts files. It now also deletes the fact cache which helps (especially when I am renaming the servers).
Ansible - role dependencies
Several times I have got the ordering in my playbooks wrong, by starting a daemon before I've installed all of its config files. This is usually benign, except when there is a mistake in the config file.
When you try to fix the config file in this situation, the rc script will abort when its config file check fails on the old broken file, causing Ansible to abort before it installs the fixed config file.
After the DHCP / monit cockup, I tried to use Ansible dependencies to
describe that (e.g.) named
installs a monit
config fragment. But
this led to prodding the daemon before installing the config file.
In the end I resolved this by putting the monit
role after the
named
role in the top-level playbooks (so the order is correct), and
invoking the reconfigure monit
handler from the named
role. This
is a semi-implicit dependency - Ansible will fail if the playbook does
not invoke both roles, which is good, but it doesn't use Ansible's
explicit dependency support.
This works but it isn't very satisfactory.
Ansible - network interface configuration
One of the most troublesome parts of my Ansible setup (in terms of the
amount of churn per line of code and bugs needing fixed) has been the
interface_reconfig
module. Debian's ifup
/ ifdown
programs don't
have a good way to check if the current running state of the network
interfaces matches what is specified in /etc/network/interfaces
.
My interface_reconfig
module parses /etc/network/interfaces
and
the output from ip addr show
and works out if it should bounce the
network. For dynamic IP addresses (managed by keepalived
on some of
my servers) I put specially formatted comments in
/etc/network/interfaces
to tell interface_reconfig
it should
ignore them.
This is pretty simple in principle, but it has been awkward keeping up
with changes in Ansible. The interface_reconfig
module now supports
diff mode and check mode, and it gets all configuration from
/etc/network/interfaces
so I don't have to repeat the list of
network interfaces in the playbooks.
Group write access and umask 002
On our primary server, the ipreg
home directory (where the DNS
update machinery works) was set up to allow admins logging in as
themselves to do things. I think I inherited this from the old
ip-register
VM [it was actually a Solaris Zone] which in turn came
from /group/Internet
on CUS.
There were a number of minor annoyances trying to maintain the right
permissions, and in practice we weren't using the shared group setup:
it's better to run the scripts on my workstation for ad-hoc stuff, and
when running on the live primary server it's safer to log in as
ipreg@
to avoid permissions screwups.
Safe primary DNS rebuilds
I do not have separate (floating) service IP addresses for most of my servers, so when upgrading the primary DNS server I can't upgrade the standby and swap which server is live and which is standby.
Instead, I adjusted the primary DNS server's packet filter configuration to isolate it from the other servers, so that they would never be able to see partially-empty zones while the rebuild is in progress. This worked quite nicely.
Explicit -test
in server names
The new naming scheme has explicit -test
tags in the names of the
servers that are not live. This follows our server naming guidelines,
so that our colleagues elsewhere in the department could easily see
how much they should care about something working or not.
It has turned out to work very nicely with a bit of Ansible inventory automation, so I no longer have to manually list which servers are live and which are test.
However it doesn't fit well with floating service IP addresses. If my
DNS server addressing becomes more dynamic then the -test
names may
go away.
Keepalived
The only package that caused significant headaches due to the upgrade was Keepalived.
My old keepalived-1.2 dual stack configuration did not work with keepalived-1.3, because 1.3 is stricter about VRRP protocol conformance, and my old dual stack config was inadvertently making use of some very odd behaviour by Keepalived.
I was previously trying to advertise all IPv4 and IPv6 addresses over
VRRP, but VRRP is not supposed to be dual stack in that way. What
Keepalived actually did was advertise 0.0.0.0 (sic!) in place of each
IPv6 address. The new config uses virtual_ipaddress_excluded
for the
IPv6 addresses; they still follow the IPv4 addresses but now only the
IPv4 addresses are advertised over VRRP. I previously didn't use
virtual_ipaddress_excluded
because the documentation doesn't explain
what it does.
The new server naming scheme is more regular than the old one, which has allowed me to make some nice simplifications to the keepalived configuration. This meant I needed to update the health checker script, so I took the opportunity to rewrite it from shell to perl, so it is a bit less horrible.
Unfortunately, the way I am using the health checker script to
implement dynamic server priorities isn't really how keepalived
expects to work. It logs to syslog every time one of the scripts exits
non-zero, which happens several times each second on my servers. I've
configured syslog
to discard the useless noise, but it would be
better if there was no noise in the first place.
Next steps
After the upgrades are complete, I need to change the NS records in all of our zones and in their delegations, so that I can get rid of the old names.
This is going to require a bit of programming work to update my delegation maintenace scripts which are rather stale and neglected. But this will also be a step towards automatic DNSSEC key rollovers.
I'm not sure what will happen after that. If we are ready for the IPv6 renumbering, I should try to get that out of the way reasonably swiftly. But I would like to work on porting the IP Register web front-end off Jackdaw.