YAWTBWAW
Yet Another Way To Break What's Already Working. That's what most of the "grid stuff" seems like to many of our new (and not so new) system administrators. The risk-averse instincts of professional IT staff are quite wary of, perhaps even hostile to, new services to be deployed in their datacentre. Most of our operations team are full-time permanent staff paid to make sure that stuff stays up; this is a Good Thing if you're working in a production environment, but is not very conducive to rapid prototyping, testing and integration of new services. There is usually a very long lead time between a new technology or service appearing on our radar, and its adoption in the production infrastructure. It's time to bring some order to chaos.
Executable Infrastructure - part of SAGrid-2.0
This blog post will talk a bit about some of the technology and changes in methodology which we will be adopting first in South Africa and eventually - hopefully - all across the Africa-Arabia Regional Operations Centre to tame operations of the grid. Clearly, this is still work in progress and will be documented properly in due course. For now, let's just get our ideas down on paper and talk about the work as we get it done. For those of you reading who were at or read the output of the last SAGrid All Hands meeting , this part of the infrastructure renewal project that we're calling "SAGrid-2.0" is the so-called executable infrastructure part. Before we talk about what that actually means, let's just take a look at what it's being compared to - ie, how we currently do things.
How not to operate a production infrastructure
I recently gave a talk at the e-Research Africa conference about "SAGrid-2.0"... which basically talked about the long list of ways we'd been doing things "wrong". During most of the training we'd been giving and had been through ourselves, there was was a nagging feeling that site admins just had to learn more and more tools, scripting languages, procedures, etc etc. For example :
- There was no way to check whether a site had a properly declared configuration (say, at the YAIM site-info.def level)
- There was no way to reproduce issues that site admins might be having, or even that nagios would alert the operator to.
- Although there was an attempt made (which is still ongoing) to provide explicit Standard Operating Procedures, as well as a bootstrapping method to develop new SOP, it is still difficult to ensure that someone can execute these procedures without an expert understanding of each component or tool involved.
- Finally, it was impossible to ascertain in which state/version a particular service at a given site was - mainly because that concept just did not exist in our infrastructure.
During All-Hands 2013 we had a couple of talks from team members on how to address this issue, using puppet and Ansible. Finally, a man was being taught to fish... If you'd like to know more, take a look at one of the fantastic videos by either of these projects on youtube.
Now what ?
To the naive reader (not you, by the way - you're awesome), this may seem like just another slab of meat on the operator's plate that they have to chew through and digest... How can you solve complexity by adding a further ingredient !? Well, we young padawans realised that this was actually a way to reduce complexity, by bring some order to our existing methodology.
They can take our mangled, illegible code - but they'll never take our FREEDOM!
I'm personally pretty taken by Ansible - for its simplicity and for the simple fact that a puppet capability is being developed by the very, very capable people at ZA-WITS-CORE and ZA-UJ, amongst other places. It's always a good idea to have alternative ways to solve problems, especially when they are as critical as maintaining a functional and reliable set of services. In the same way that applications are compiled on lots of different platforms and architectures, by lots of different compilers, we want to be able to "execute" our infrastructure with more than one set of tools. Plus, the whole philosophy of the grid pushes against any monolithic, monoculture of software and tools.
Where are we going with this ?
There's a realisation amongst SAGridOps that these orchestration tools mean that
infrastructure = code
Actually, it's not an exaggeration these days to approximate that to first-order and just say "everything = code" - but that is a story for a different revolution. If infrastructure = code, then that means we can apply a lot of the development methodologies to the way we "code" our infrastructure (using ansible, puppet, etc). You can keep the infrastructure in a version controlled repository; you can collaborate to develop the infrastructure around this, using all the cool buzz-word sounding methodologies that have been been developed for software engineering over the years. Infrastructure can be versioned, it can be passed through tests from testing to production... and best, most of this can be automated to a large degree.
Come on in, the water's great !
You'll be seeing a lot more Octocats - LOLcat's somewhat more productive cousin |
This is where we are going... If you're up for the ride and want to have a good 'ol time, you can join the team by signing up for the SAGridOps organisation at github : https://github.com/SAGridOps. I'll be harping on about how awesome GitHub is in the future; for now, suffice to say it's given us a way to open up SAGrid to the developers out there who want to help us keep building an awesome set of services for e-Science.