Tuesday 4 March 2014

Executing AAROC with Ansible

TL;DR : DevOps is helping us quickly deploy new sites in Africa. KENET and TERNET are blazing a trail. Ansible is pretty awesome, but can be improved.

We've almost finished the integration of Africa-Arabia ROC with EGI.eu. While we're waiting for the MoU to pass through the CSIR's approval chain, colleagues at KENET and TERNET have been preparing their sites for grid deployment. 

More grid for Africa

"The grid" as a cultural/scientific phenomenon may have peaked in most parts of the world, but it never even started here in Africa. Underlying this was not the lack of desire, scientific know-how, or important problems to address with computational infrastructures. These are certainly present throughout Africa, and we're finding more and more examples as we go. What was missing was the underlying support structures - the network, funding schemes, mobility, a shared goal; all of which are necessary for collaboration. 
These are having a bright, perhaps transient moment in our region right now, and we need to take advantage of the various support activities while they are in the limelight, since nothing guarantees that they will continue as such (alas, on the contrary). However, perhaps the most concrete development so far is not the computing paradigm, or the technology being used, but the methodology adopted (at least by our little collaboration - I don't want to speak for others who might be a bit more evolved than we are !). By insisting that the grid was complicated, and required dedicated site administrator training prior to any sites being installed, we were given a perfect excuse for delaying this work. 



What took you so long !?

We've spoken about this for about 3 years now, usually getting all excited about integration of the sites around the time Ubuntunet Connect rolls around... but so far it just hasn't come together. There are good and not-so-good reasons for this. Good reasons include :
  • lack of competent Certificate Authorities in the region
  • overhead in training local site admins in grid middleware
  • overhead in manually operating sites and ensuring availability
  • connectivity issues
These barriers have all recently been drastically reduced recently, in places, not all. Also, the distribution of hardware available to do scientific computing also widely varies, but we've got two very competent and capable NRENs in the form of KENET and TERNET to work with. We've built up a lot of trust over recent years working with their engineers on various projects, and it doesn't hurt that they're both members of the ei4Africa project. 
Finally, the CHAIN-REDS project has strongly supported this integration between EGI.eu and African NGI's (if you want to call them that), first in the case of NGI_ZA in South Africa and now in the case of AfricaArabia. Being integrated into EGI.eu's operational tools gives us a a lot more power to understand what's wrong with our sites and preemptively fix them.

Ok, but gridding is still hard !

True, we're a lot speedier now than when SAGrid was first starting out for two reasons : 
  1. years of experience
  2. manual labour is for making olive oil, not ICT infrastructure
Yeah, we're DevOps now... This is awesome in so many ways. Let me count them

1) Clear starting point

Using the GOCDB, the bare minimum for deploying a site is very clear. You need site information, a few IP's and, assuming that the hosts are running a supported OS, we're good to go. In the case of KENET's site deployment, this was done in a matter of 2 emails and a screenshot... we could have done it in one go, but the site admins didn't have access to the GOCDB (no IGTF cert).

2) Less complexity

We've developed the playbooks necessary to get a site from OS to grid in a few simple steps. One credential, no "'follow this extremely complicated guide, which will obviously leave out some major details, or special case that applies to you". Speaking of which...

3) Better understanding of everything that is needed to deploy a site

The great thing about DevOps in general is that everything=code. We can thus code for these special cases. What is more, since Ansible in particular is very easy to read, a new site admin can very quickly understand how a certain state was reached. 

4) Easier to collaborate

Everything's version-controlled and kept in a a Github repo. Need I say more ? (probably, but let's leave that to a later edition).

5) Try before you buy

Before deploying services into a production environment, we can and do test them first in a development environment with machines provisioned in a private cloud. This helps a lot to speed up the development of the DevOps code, as well as ensure that moves from one release of middleware/OS etc to the next are smooth and go as expected. 

East African sites

Right now we're working on deploying two sites which have been on or radar (as AfricaGrid) for a very long time, at the University of Nairobi and the Tanzanian Dar es Salaam Institute of Technology. There are small clusters there in active use by researchers, running mainly bioinformatics, mathematics and weather research applications. We've added them to the GOCDB already and the Ansible accounts have been provisioned by the local site admins. 

Dealing with details

Of course, this is a litmus test for our preaching about how DevOps is the future and everything is easy now. We ourselves forked the Ansible code from GRNET (and Wits is using CERN's puppet modules), having adapted it to our local needs. This is almost as simple as changing a few variables, but we also need to make as few assumptions on the OS and installed packages as possible. At KENET for example, the machines came with no SELinux... you would assume that this is great, since the first thing the EMI docs tell you is to turn it off - but it actually breaks a lot of the orchestration code. 
The same thing goes for the firewall - we assume that iptables is installed, so if it's not we have to provision it and then configure it. Then, we've got the issue of previously-installed cluster management systems like ROCKS to deal with. This is the case at DIT, and was also the case at SAAO... by no means do we have the right as a federation to insist on specific software at a site, apart from the basic middleware needed to respect the interoperability standards, so we have to write playbooks and modules to configure services taking into account what's already there.
These special cases are highlighting how much we actually assume about the prior state of a cluster, and the power of DevOps to plan, test and ensure a smooth deployment.

Stay tuned

These are the firs sites outside of South Africa that we are deploying directly as a regional infrastructure (sites in North Africa are migrating middleware right now, but have always been part of a regional infrastructure). We want to do it right; we're going step by verified step, so that when when we find new sites in Africa - and there are a huge amount of them ! - we can integrate them in a flash. 

Stay tuned.