Tuesday 4 March 2014

Executing AAROC with Ansible

TL;DR : DevOps is helping us quickly deploy new sites in Africa. KENET and TERNET are blazing a trail. Ansible is pretty awesome, but can be improved.

We've almost finished the integration of Africa-Arabia ROC with EGI.eu. While we're waiting for the MoU to pass through the CSIR's approval chain, colleagues at KENET and TERNET have been preparing their sites for grid deployment. 

More grid for Africa

"The grid" as a cultural/scientific phenomenon may have peaked in most parts of the world, but it never even started here in Africa. Underlying this was not the lack of desire, scientific know-how, or important problems to address with computational infrastructures. These are certainly present throughout Africa, and we're finding more and more examples as we go. What was missing was the underlying support structures - the network, funding schemes, mobility, a shared goal; all of which are necessary for collaboration. 
These are having a bright, perhaps transient moment in our region right now, and we need to take advantage of the various support activities while they are in the limelight, since nothing guarantees that they will continue as such (alas, on the contrary). However, perhaps the most concrete development so far is not the computing paradigm, or the technology being used, but the methodology adopted (at least by our little collaboration - I don't want to speak for others who might be a bit more evolved than we are !). By insisting that the grid was complicated, and required dedicated site administrator training prior to any sites being installed, we were given a perfect excuse for delaying this work. 



What took you so long !?

We've spoken about this for about 3 years now, usually getting all excited about integration of the sites around the time Ubuntunet Connect rolls around... but so far it just hasn't come together. There are good and not-so-good reasons for this. Good reasons include :
  • lack of competent Certificate Authorities in the region
  • overhead in training local site admins in grid middleware
  • overhead in manually operating sites and ensuring availability
  • connectivity issues
These barriers have all recently been drastically reduced recently, in places, not all. Also, the distribution of hardware available to do scientific computing also widely varies, but we've got two very competent and capable NRENs in the form of KENET and TERNET to work with. We've built up a lot of trust over recent years working with their engineers on various projects, and it doesn't hurt that they're both members of the ei4Africa project. 
Finally, the CHAIN-REDS project has strongly supported this integration between EGI.eu and African NGI's (if you want to call them that), first in the case of NGI_ZA in South Africa and now in the case of AfricaArabia. Being integrated into EGI.eu's operational tools gives us a a lot more power to understand what's wrong with our sites and preemptively fix them.

Ok, but gridding is still hard !

True, we're a lot speedier now than when SAGrid was first starting out for two reasons : 
  1. years of experience
  2. manual labour is for making olive oil, not ICT infrastructure
Yeah, we're DevOps now... This is awesome in so many ways. Let me count them

1) Clear starting point

Using the GOCDB, the bare minimum for deploying a site is very clear. You need site information, a few IP's and, assuming that the hosts are running a supported OS, we're good to go. In the case of KENET's site deployment, this was done in a matter of 2 emails and a screenshot... we could have done it in one go, but the site admins didn't have access to the GOCDB (no IGTF cert).

2) Less complexity

We've developed the playbooks necessary to get a site from OS to grid in a few simple steps. One credential, no "'follow this extremely complicated guide, which will obviously leave out some major details, or special case that applies to you". Speaking of which...

3) Better understanding of everything that is needed to deploy a site

The great thing about DevOps in general is that everything=code. We can thus code for these special cases. What is more, since Ansible in particular is very easy to read, a new site admin can very quickly understand how a certain state was reached. 

4) Easier to collaborate

Everything's version-controlled and kept in a a Github repo. Need I say more ? (probably, but let's leave that to a later edition).

5) Try before you buy

Before deploying services into a production environment, we can and do test them first in a development environment with machines provisioned in a private cloud. This helps a lot to speed up the development of the DevOps code, as well as ensure that moves from one release of middleware/OS etc to the next are smooth and go as expected. 

East African sites

Right now we're working on deploying two sites which have been on or radar (as AfricaGrid) for a very long time, at the University of Nairobi and the Tanzanian Dar es Salaam Institute of Technology. There are small clusters there in active use by researchers, running mainly bioinformatics, mathematics and weather research applications. We've added them to the GOCDB already and the Ansible accounts have been provisioned by the local site admins. 

Dealing with details

Of course, this is a litmus test for our preaching about how DevOps is the future and everything is easy now. We ourselves forked the Ansible code from GRNET (and Wits is using CERN's puppet modules), having adapted it to our local needs. This is almost as simple as changing a few variables, but we also need to make as few assumptions on the OS and installed packages as possible. At KENET for example, the machines came with no SELinux... you would assume that this is great, since the first thing the EMI docs tell you is to turn it off - but it actually breaks a lot of the orchestration code. 
The same thing goes for the firewall - we assume that iptables is installed, so if it's not we have to provision it and then configure it. Then, we've got the issue of previously-installed cluster management systems like ROCKS to deal with. This is the case at DIT, and was also the case at SAAO... by no means do we have the right as a federation to insist on specific software at a site, apart from the basic middleware needed to respect the interoperability standards, so we have to write playbooks and modules to configure services taking into account what's already there.
These special cases are highlighting how much we actually assume about the prior state of a cluster, and the power of DevOps to plan, test and ensure a smooth deployment.

Stay tuned

These are the firs sites outside of South Africa that we are deploying directly as a regional infrastructure (sites in North Africa are migrating middleware right now, but have always been part of a regional infrastructure). We want to do it right; we're going step by verified step, so that when when we find new sites in Africa - and there are a huge amount of them ! - we can integrate them in a flash. 

Stay tuned.


Wednesday 26 February 2014

Towards the Africa-Arabia Regional Operations Centre.

We've made some progress towards the certification of the AfricaArabia Regional Operations Centre.


The AfricaArabia Regional Operations Centre (AAROC) is the container project which aims to coordinate the operations of grid, HPC and cloud sites in Africa and Arabia, essentially continuing and extending the work of EUMedGrid and SAGrid. This is essential to interoperability between infrastructures in our region and those in other parts of the world. It is done with EGI.eu as a trusted party in providing certain global services.

The ROC exists as an informal collaboration since the days of the original CHAIN project. Indeed there are several other ROCs in the world (AsiaPacific, ROC_LA for Latin America, etc) which were originally conceived to federate the operations of grid infrastructure for the worldwide LHC Compute Grid. Along with the evolution of EGEE into EGI, these were evolved into Operations Centres (OC) - the term now used to refer to self-contained infrastructures which need to interoperate, be they national or regional in scope. 

We've previously created an OC for South Africa (NGI_ZA) which we've successfully operated since 2011. We now have been mandated to extend this to the rest of the region, via the CHAIN-REDS project. 

Technical vs policy requirements

This work consists of two levels - technical and policy. The technical part is actually very easy - in fact it's been almost codified by EGI.eu through years of experience into an easy-to-follow procedure. However, before the technical work can start, a policy framework needs to be created and agreed upon. This is summarised in PROC 02 as follows:  
Political ValidationCASE 1. If an Operations Centre is already represented within the EGI Council and is ready to move from an EGEE ROC to an operational Operations Centre, we recommend that the Operations Centre political representative within the EGI Council notifies the EGI Chief Operations Officer that the respective Operations Centre is entering its validation cycle. At this point, technical validation can start.
CASE 2. If an Operations Centre is not represented within the EGI Council, and it is willing to be represented there, the Operations Centre needs to submit a request for admission to the Council. After the Operations Centre has been accepted by the Council, CASE 1 applies.
CASE 3. If a new Operations Centre is not represented within the EGI Council and is not interested in being part of it, but would still like to be a consumer of the EGI Global Services, then an MoU must be established with EGI. Once an MoU is in place technical validation can start.
Since Africa and Arabia are of course not represented in the EGI council, nor do we intend to be, we fall in Case 3, meaning we have to sign an MoU with EGI.eu. As previously mentioned, this work has been under way for quite a while now, due to the widely distributed nature of the collaboration.

Even at the national level, there's no single entity which formally represents the resource providers in a federated infrastructure, much less at a regional or continental level. In South Africa, we had the luxury of a flagbearing institute which could take operational responsibility the interoperability and thus sign the MoU with EGI.eu which specifies, amongst others, the operating and service levels. 

We've done the simplest thing possible to implement AAROC as a point of interoperability between African e-Infrastructures and their European and other counterparts - the CSIR's Meraka Institute will once again sign the MoU with EGI.eu. This addresses the policy issue in a simple way, but does leave potential issues related to the internal cohesion of the ROC members. Better to have something in place than nothing at all...

What's the holdup ? 

Currently the MoU is going up the ranks of the CSIR routing procedure. It has been vetted by the CSIR's legal department previously, with no issues found. The MoU has operational implications for SAGrid/SANREN, but beyond that no financial or IP implications, which makes things a bit easier. Nonetheless, since collaboration is at the base of technical collaboration in Africa we need the MoU to undergo the full internal support and understanding of the Meraka Institute. We're currently awaiting final signature by the Meraka Acting Director. 

Ok, take me to the fun bits

The technical procedure is much simpler and is proceding in parallel to the policy procedure. This essentially involves the creation of the AfricaArabia ROC as an operations centre in the various global services of EGI 
  1. GOCDB - the Global Operations Database which contains all the sites and respective service endpoints
  2. GGUS - the worldwide helpdesk
  3. Operations Portal - operational overview of the state of the infrastructure
  4. SAM - Service Availability and Monitoring and functional tests of various grid services
  5. Accounting Portal - the central accounting portal

But first...

Of course, creating the OC is very simple, but it needs to be done in a trustworthy way. That's why there are a lot of prerequisites before embarking on the technical work. Here's a quick roundup of where we stand on those prerequisites.

  1. Make sure your NGI is able to fulfill RP OLA https://documents.egi.eu/secure/ShowDocument?docid=463
    1. The OLA has been already satisfied for NGI_ZA, so this is just a change of scope for us.
  2. Decide about the Operations Centre name. Name for European Operations Centres should start with "NGI_"
    1. We've chosen AfricaArabia
  3. Decide whether to use the Operations Centre's own help desk system or use GGUS directly. If the Operations Centre wants to set up their own system they need to provide an interface for interaction to GGUS with the local ticketing system and follow the recommendations available at https://ggus.eu/pages/ggus-docs/interfaces/docu_ggus_interfaces.php.
    1. We're using GGUS directly. We used to have a regional (xGUS) instance, but support has been dropped for this and we need to be able to easily escalate and route tickets. The easiest way to do this is to create a support unit in GGUS.
  4. Set contact points 
    1. a set of mailing lists for site directors, operations, security, etc have been created under the domain africa-grid.org
  5. All certified Operations Centre sites need to be under Nagios monitoring. 
    1. ours is already configured, but the region needs to be changed from NGI_ZA to AfricaArabia
  6. Fill the FAQ document for the Operations Centre - 
    1. see  https://wiki.egi.eu/wiki/GGUS:AfricaArabia_FAQ
  7. Staff in the Operations Centre that should be granted a management role
    1. Go ahead and request your role (select NGI AfricaArabia and role ROD
  8. Staff in Operations Centre is familiar with Operational Procedures
  9. EGI adds people to relevant mailing lists..
The whole procedure is summarised in the table below : 

Since we have already added at least part of the operations team to the ops and dteam VO, and our nagios instance is up - we just need to change the OC in it from NGI_ZA to AfricaArabia - this puts us on Step 6 of PROC02

What's next ? 

The next step from our side is the validation of the nagios instance. This will take a few days, and in the meantime COD will be checking our information. Once that's done, we'll move the resource centres currently in NGI_ZA into AfricaArabia and decommission NGI_ZA.

Of course, this all assumes we can expect a speedy response from the CSIR's leadership and signature of the MoU !


Monday 6 January 2014

Renew, Refactor, Release, Repeat. Now you are Ready.

In this issue: What have we been up to from October to December ?; The end of year collapse (ie, Christmas for sysadmins); updates on our executable infrastructure; thoughts on collaboration; why it's called SAGrid-2.0; and happy new year !

Collapse 


TL;DR - the end of the academic year is hell, but then it's ok. for about a week.


One of the ironies of writing updates for an active community is that the more activity there is, the less time there is to write about it and keep all the interested parties informed. The last few months, from October to December 2013, were - at least for me - a blur of meetings, planning, conferences, and other activities, including some development work here and there for new services on the grid. 
It was indeed an epic mission, kicked off by the first (and ambitiously-titled) e-Research Africa conference, organised by ASAUDIT and well-attended by members of several Australasian and European e-research infrastructures. From our side, we gave a good series of presentations on the Regional Operations Centre and SAGrid-2.0. There was also a hefty contingent of presentations from the CHAIN-REDS project, presenting work done on the science gateways, semantic search of data and document repositories, and the interoperability of data and computing infrastructures. All of this has led to some interesting new links between the CHAIN-REDS project and the University of Cape Town.

Renewal. 


TL;DR - A new HPC forum is created. SAGrid-2.0 is taking shape; First Release 02/14


HPC Forum is created

Without further ado, a quick reminder that a forum for HPC system adminstrators has been created at the e-Research Africa conference and held its first meeting at the CHPC conference in December in Cape Town. It's a place specifically for HPC sysadmins to get together and share their experience and issues, and is supported by the AfricaGrid Regional Operations Centre (via the CHAIN-REDS project for now). Anyone interested in joining can get hold of Peter van Heusden at SANBI. 

Raising our game

We're written about this before, but a major theme for 2014 is renewal. We started this project in 2009 to build a federated, distributed computing infrastructure for South African research communities, based on the paradigm of grid computing. That paradigm itself has seen fundamental changes - not only in the technology which is used to implement it, but also - and perhaps more importantly - in the community and business models that are being adopted to operate it.
Thanks to the closer collaboration with EGI.eu via the research infrastructure provider MoU's that the CSIR has signed with them, we are able to better understand and hence adopt best practice for distributed computing infrastructures. This also means that our inadequacies as an infrastructure and as resource-providing sites are put into stark relief, since we have to publish our availability and reliability (amongst other things, such as accounting data) to EGI.eu. We'll just have to overlook the fact that during December, as an infrastructure our A/R was close to 3%... ouch. Yes, people were on holiday, but no, that's not an excuse. We'll have to make great strides towards improving this over the next few weeks, especially at the site level - only 2 of 6 sites are fully functional and passing tests. This is entirely unacceptable and while all our site admins need to take some of the responsibility for this, the final onus is on the Regional Operator on Duty (ROD). Clearly this role is not yet mature and needs some work. Documentation, Training, Certification. 

Robots' running on the stage


Wouldn't it be great if we could eliminate humans from our operations entirely ? I mean... just wind the grid up and watch it go ? That would be rad. No, I submit that that would be très rad. 
Avid readers (ha ha) will recall that we started some work on this back in 2013, forking the GRNET repo for executable infrastructure on github. This is a set of ansible playbooks and other code for deploying UMD or EMI middleware from the OS up. Of course, having a robot like Ansible is no good for you unless you know what do do with it. And to get any experience, you need a playground. This is where having a private cloud manager comes in very handy indeed, giving you the ability to easily define a network range which is only for development and/or staging purposes and testing your code out on that instead of touching the production machines. Textbook stuff, made easy - and indeed that's what we've done at Meraka with the OpenNebula installation there.
Although we're far from done in preparing a set of playbooks which can be run by any site admin on the grid at their site, we have set up all the necessary tools:
  • a distributed source code repo which any site can fork, modify and if desired contribute back into the main branch - https://www.github.com/AAROC/ansible-for-grid 
  • a safe development and staging environment for said code, where site admins can validate their updates and configurations without nuking anything of actual value... and if things go wrong, they can roll back to a previously tested and validated version
To be entirely honest, that last bit - the testing and validation - is missing at the moment. Indeed, the staging site will be receiving nagios probes so, but this is somewhat asynchronous and clunky. What we want to do quite soon is include DevOps tests in the Jenkins CI service under deployment at UFS so that any code commits to the Github repo automatically trigger a functional test of the relevant service. DevOps can then consult the Jenkins dashboard to know whether their code will nuke their site or not. 
Let's say you want to reconfigure your site to enable MPI or GPGPU jobs. Following the YAIM documentation, you modify the vanilla playbooks to enable this in the CE, then commit the code back to the repo. Jenkins will see this, and execute the functional test associated with that playbook, which in this case is "deploy the site, update the top-bdii, wait for the site to show up in the top-bdii, send a job to the site, get the output and check whether the job ran properly". If the job didn't run properly, or something died before that, Jenkins would tell you and you would probably have saved a bunch of time and had a far deeper chat about the nature of reality with that weird dude who's always bringing these things up over coffee. 
Now, you tell me that would not be rad. I dare you ! 

Robots in the clouds, everywhere; trust them.


Also, you tell me how you would do that without having access to some kind of flexible computing infrastructure. For each new test, you need to start pretty much from scratch - having a private cloud which you can interact with programmatically (ie, via some kind of API, preferably a standards-based on like OCCI) - makes your testing environment very powerful and scalable. 
However, there is also the issue of special hardware and re-using resources. Take the example above of testing a new site with some fancy GPGPUs: before deploying the site into the grid, we'd like to be sure that our deployment code is going to work and that jobs will run. To do this, we would like to use the same code to test and deploy it (this is one of the things the Ansible guys keep harping on and I'm starting to get it). Now, consider that that fancy new hardware is at one site (say, UCT) and the testing service is at another (UFS). What's a sysadmin to do - entirely replicate one or the other ? Wouldn't it make sense to share and re-use services in the same way we do with resources on the grid, by federating the access to them and ensuring that they expose standard interfaces ? So, UCT can just write the tests for its fancy new kit without having to worry about the whole overhead actually running the CI service. Now you're thinking with portals ! (sorry).

Release


It's one thing to imagine how these tools - when they're properly implemented - will make everyone's life easier, but it's another thing entirely actually implement them. For one thing, how do we know when it's done ? How do we know that we can actually trust the code to not break our sites and so on ? Continuous Integration via Jenkins is of course part of the answer to this - so when our Ansible and/or Puppet code (site admins - remember that we need your contributions to the github repo !) is indeed tested with Jenkins we'll be able to tick off a big TODO. 
However, how's a site somewhere to know that the code is ready for production ? Then there's the issue of what version of the code you're using to deploy your site; it's not as if we'll just hack away at it and then it'll be finished (god forbid!) - this is going to be a continuous (sorry) exercise. 
The answer is, of course, trivial - the code for deploying production sites will be tagged as such in a branch and released. It's kinda corny, I know, in the 2010's (what are we calling this decade again ? - please don't say the twenteenies.), but that's the idea behind calling it "SAGrid-2.0" - because the code that was used to deploy it will have a version number which you can refer to, instead of what we usually do now which is go through a bunch of bash histories and try figure out what the **** happened ! 

Happy New Year !


So, that's it from the the coordinator. I'm looking forward to working with all of you during the year, implementing and improving the services we need to continue serving African and South African research communities. We'll have our first meeting before the end of the month. In the meantime, here's wishing a prosperous, efficient and productive year to all ! 



Tuesday 5 November 2013

ROC Shift Handover Report - October

Africa-Arabia Regional Operations Centre Situation Report and Shift Handover Report for October 2013

It's been a rough month with lots of work for us, and the end of the month always comes too quickly to prepare for it. We had the monthly shift handover meeting over Google+ last Friday the 1st of November where we discussed the progress made during the last month and the issues at hand at our sites for the next shifter on duty. 

Situation Report for SAGrid

As is customary, a short summary of the situation of our infrastructure is given here, followed by some comment and feedback of recent work.

Site readiness

Of our seven sites, only two are fully functional (ZA-UJ and ZA-WITS-CORE). Not surprisingly, these are also the only sites, along with ZA-MERAKA, which have passed EGI certification. While local services are running fine at the rest of the sites (ZA-UCT-ICTS, ZA-CHPC, ZA-UFS), they are failing various nagios tests due to local misconfigurations. In particular, there are strange issues with ZA-CHPC and ZA-UFS which Sakhile and Fanie are still trying to resolve respectively. 

Resource Infrastructure Provider Status

The core services provided by ZA-MERAKA for NGI_ZA are all fully functional and have achieved 100 % availability/reliability. However, we are still struggling to get the accounting records published from certified sites ZA-UJ and ZA-WITS-CORE to the regional instance at Meraka and from there to the central instance at the EGI central accounting portal. Uli is working on that. 

Situation Report for 1LS + TPM Activity

Likely due to the fact that there were several conferences this month, there was not much activity on the support side. The previous shifter on duty (DZ-eScience Grid) therefore had a pretty light shift. There are, however ongoing issues, with tickets open against all sites to varying priority. This means we have not met the target specified in our OLA regarding ticket management. This was highlighted during the meeting and impressed upon the next shifter on duty (TERNET). 

This is TERNET's first shift and we'll be working hard to ensure that during this month we get their site(s) up and running. 


Updates to the ROC

It's a time of big change in SAGrid and for the ROC as a whole - we are still trying to become familiar with the EGI Operating Procedures and trying to adapt our legacy procedures legacy procedures to them. While, thanks to EMI there is plenty of good quality documentation for the middleware, there is a still a lot of confusion regarding the applicability of various procedures and standards to be used in the production infrastructure, and many of the sites in Africa have legacy configurations which are affecting performance. It is the role of the Regional Operations Centre to bring order to this situation and to provide as accurate as possible an overview of the current and future status of the infrastructure, through monitors and planning. The input from sites is of course essential to this, as is the timeous and accurate response from their side to calls for updates, etc. 

We need to have a pretty serious re-design of the ROC website in the light of the work we've done and think carefully about how we want to expose the services run by it to the grid-ops community as well as the wider HPC, network and data infrastructures. 

Integration of new African Sites into the ROC

A renewed focus to finalise the MoU with EGI that will allow the CSIR to represent all African and Arabian sites when interoperating with EGI is underway. This is essentially the same MoU as was signed for integration of NGI_ZA, with the difference that all sites in the region covered by the ROC would be able to be registered in the GOCDB. Meraka would continue to play the coordination role by ensuring that these sites adhere to the necessary procedures, while acting as a liaison between African and Arabian technical and scientific communities and their European counterparts, in the grid context.  
 

Quality, Robots and the Coding Public

If you've been following our work recently, you'll have heard about "SAGrid-2.0". If not, it's basically judicious employment of Jenkins, Ansible/Puppet/etc, CVMFS and Github. In the last two weeks, a lot of that has been coming together quite nicely, after a few days spent with colleagues in Bloemfontein at the ZA-UFS site. We've been hacking away at the Ansible playbooks in Github and these are going to be tested at Meraka before being released, while we're still calling for puppet modules in use by our site admins to be contributed to the SAGridOps repo in Github...

A Jenkins instance was installed and configured for our needs, and this was used to write a few basic tests for building our supported applications. This CI approach is proving to be very flexible indeed and we're starting to converge on a strategy for a highly automated quality assurance chain which gives freedom to application maintainers and developers to do their work without intervention from the Ops side. The goal is to provide an endpoint to the "public" (in this case - the coding public who are interested in using the grid to run their or their community's applications) to automatically run predetermined tests to see whether an application will build and run on a standard worker node. A few other environments are also being conceived beyond this boring old vanilla setup : 
  • the GPGPU-enabled worker node
  • the "HPC" worker node with infiniband, OMP, MPI, etc available
  • the "next-gen" worker node, which will have the latest OS and middleware (untested) installed
Of course, input as to what you want to have your code build and execute on is welcome. 

Once code has passed functional tests, it should be moved from the testing area to the staging area, where final checks are run before it moves into the production repository. Since this will be a CvmFS repository, that application will then suddenly be everywhere. Assuming, of  course that our site admins actually have the repo mounted !

Upcoming meetings and conferences

There's plenty of other fun stuff coming up this month :

It's November : Keep on crunching !




Tuesday 15 October 2013

What's coming in 2014

YAWTBWAW

Yet Another Way To Break What's Already Working. That's what most of the "grid stuff" seems like to many of our new (and not so new) system administrators. The risk-averse instincts of professional IT staff are quite wary of, perhaps even hostile to, new services to be deployed in their datacentre. Most of our operations team are full-time permanent staff paid to make sure that stuff stays up; this is a Good Thing if you're working in a production environment, but is not very conducive to rapid prototyping, testing and integration of new services. There is usually a very long lead time between a new technology or service appearing on our radar, and its adoption in the production infrastructure. It's time to bring some order to chaos.

Executable Infrastructure - part of SAGrid-2.0

This blog post will talk a bit about some of the technology and changes in methodology which we will be adopting first in South Africa and eventually - hopefully - all across the Africa-Arabia Regional Operations Centre to tame operations of the grid. Clearly, this is still work in progress and will be documented properly in due course. For now, let's just get our ideas down on paper and talk about the work as we get it done. For those of you reading who were at or read the output of the last SAGrid All Hands meeting , this part of the infrastructure renewal project that we're calling "SAGrid-2.0" is the so-called executable infrastructure part. Before we talk about what that actually means, let's just take a look at what it's being compared to - ie, how we currently do things. 

http://demotivationalpost.com/demotivators/12851255100/41mp-youre-doing-it-wrong.jpgHow not to operate a production infrastructure

I recently gave a talk at the e-Research Africa conference about "SAGrid-2.0"... which basically talked about the long list of ways we'd been doing things "wrong". During most of the training we'd been giving and had been through ourselves, there was was a nagging feeling that site admins just had to learn more and more tools, scripting languages, procedures, etc etc. For example : 
  1. There was no way to check whether a site had a properly declared configuration (say, at the YAIM site-info.def level)
  2. There was no way to reproduce issues that site admins might be having, or even that nagios would alert the operator to.
  3. Although there was an attempt made (which is still ongoing) to provide explicit Standard Operating Procedures, as well as a bootstrapping method to develop new SOP, it is still difficult to ensure that someone can execute these procedures without an expert understanding of each component or tool involved. 
  4. Finally, it was impossible to ascertain in which state/version a particular service at a given site was - mainly because that concept just did not exist in our infrastructure.
During All-Hands 2013 we had a couple of talks from team members on how to address this issue, using puppet and Ansible. Finally, a man was being taught to fish... If you'd like to know more, take a look at one of the fantastic videos by either of these projects on youtube.







Now what ? 

To the naive reader (not you,  by the way - you're awesome), this may seem like just another slab of meat on the operator's plate that they have to chew through and digest... How can you solve complexity by adding a further ingredient !? Well, we young padawans realised that this was actually a way to reduce complexity, by bring some order to our existing methodology. 

They can take our mangled, illegible code - but they'll never take our FREEDOM! 

I'm personally pretty taken by Ansible - for its simplicity and for the simple fact that a puppet capability is being developed by the very, very capable people at ZA-WITS-CORE and ZA-UJ, amongst other places. It's always a good idea to have alternative ways to solve problems, especially when they are as critical as maintaining a functional and reliable set of services. In the same way that applications are compiled on lots of different platforms and architectures, by lots of different compilers, we want to be able to "execute" our infrastructure with more than one set of tools. Plus, the whole philosophy of the grid pushes against any monolithic, monoculture of software and tools. 

Where are we going with this ? 

There's a realisation amongst SAGridOps that these orchestration tools mean that 
infrastructure = code

Actually, it's not an exaggeration these days to approximate that to first-order and just say "everything = code" - but that is a story for a different revolution. If infrastructure = code, then that means we can apply a lot of the development methodologies to the way we "code" our infrastructure (using ansible, puppet, etc). You can keep the infrastructure in a version controlled repository; you can collaborate to develop the infrastructure around this, using all the cool buzz-word sounding methodologies that have been been developed for software engineering over the years. Infrastructure can be versioned, it can be passed through tests from testing to production... and best, most of this can be automated to a large degree. 

Come on in, the water's great !

You'll be seeing a lot more Octocats -
 LOLcat's somewhat more productive cousin
This is where we are going... If you're up for the ride and want to have a good 'ol time, you can join the team by signing up for the SAGridOps organisation at github : https://github.com/SAGridOps. I'll be harping on about how awesome GitHub is in the future; for now, suffice to say it's given us a way to open up SAGrid to the developers out there who want to help us keep building an awesome set of services for e-Science. 




Situation Report - September

Operator : Situation Report

A main goal of this blog is to get a summary out to the wider public of what it is we do on a daily basis and what issues we have in our daily lives. I like to think we run a tight ship, although we probably could work a bit to improve our efficiency. One thing we do almost every month is have a ROC-wide meeting of the site admins, to discuss upcoming and current issues, as well as perform the formal handover of the First-Level Support and Ticket Process Managment Shifts (1LS/TPM) shift to the next support unit on duty.  
Hopefully, while writing this blog, we'll get some material together to update our somewhat decrepit website. It's not our fault - we're just always busy ! It will get a fix soon, 

What's happening on the grid ?

If only I had a rand/beer for every time I got asked that question ! It is indeed hard to know what is going on at each site and what the current issues are - and this is precisely the point of the GridOps meetings. Since the creation of the ROC, we've starting including more and more other sites on the African continent in our SitRep meetings, which started off on a weekly basis, and were more of a group chat/support group than a real operations meeting. We've got things down almost to an automated procedure by now so that we can meet once a month for under 30 minutes and exchange just the right amount of information. Actually, we have a draft procedure which describes what should be done, as well as a FAQ for those in the hot seat. For the really curious (and insomniac), the workflow is on the right.

Big steps

If you were in the meeting at the beginning of the month, you'll know that we finalised the integration with EGI.eu, but some sites are still undergoing certification. The SAGrid NGI monitor (based on nagios), has been keeping an eye on all of the services we publish in the GOCDB, and our sites are showing up in the operations portal. Although this may sound somewhat boring, it took about 2 years of work to get our sites and infrastructure up to par. Thanks again to everyone who work on this ! The main issue now is to ensure that we maintain the commitment to our OLA.

How are we doing ? 

The Operating Level Agreement agreed to by our sites forms the basis for the certification procedure, whereby they are included in the production infrastructure. It sounds harcore, but actually it's just a way of agreeing what our sites will be capable of  and actually it's pretty reasonable. More about these metrics in subsequent blogs, hopefully, but suffice to say for now that we're meeting one very important target - response time to issues : 



Monday 14 October 2013

Order, order

Order, order !

Well, it had to start sometime... five years after we started isn't too late, right ? The South African National Grid was started way back in 2008 during a quick meeting at iThemba L.A.B.S.  where several directors of IT and research groups were present. The idea was simple : let's put the spare, underutilised, badly coordinated, kit we have at the universities to work, by integrating them into a national compute grid. Back in those days, grid was still spelt with capital letters, as if it were some strange alien beast which we had to bow down to and praise. We didn't have a network and almost everyone thought that you needed a Ph.D. in physics to use it. 
Nevertheless, we built it. UCT, iThemba, UFS, NWU, UJ and Wits put up their hands and Meraka made a position open to coordinate activities. We began training the system administrators of the sites in 2009 with a workshop at UCT and thus began the wave which we're still on to this day. I've been trying to convince other institutes to join, and thanks to the EPIKH programme, we ran quite a few training workshops in South Africa from 2009 to 2012. During this time, we forged a strong bond amongst the guys that would become the SAGrid Operations Team, or SAGridOps. This blog is dedicated to and written by them, the dudes in the trenches who make the grid work.

Take a deep breath and check your compass

Our points of reference have changed dramatically over the last few years and it has come time to put some order into our house here in SA. Not only have we seen the rise of cloud computing, we have probably figured out by now what to do with it. In my humble opinion, it has gone from a perceived threat to the lifeboat that will save us all. More about that in a different blog...
We started off with a clear idea of who our peers and support network were. The EGEE project was both our support staff as well as the middleware provider. Chosing gLite at that time was the natural option, but perhaps that issue too would be an interesting one to delve into sometime. The European Commission had funded a massive exchange programme, EPIKH, which I somehow managed to get the CSIR as a partner to, which, in my opinion, provided us the one thing we couldn't do on our own - train ourselves. We had a few use cases, but as time went on these grew less and less certain and turned out to be one of the main weaknesses of our "come one, come all" approach. 
Things have changed for the better and we need to adjust our sails to take advantage of this. We have a functioning Regional Operations Centre (although it needs some work), an resource infrastructure sharing MoU with EGI.eu. We have a new upstream middleware provider (although that is now, too, in a state of flux) and a far better idea of who we can and should serve in South Africa. The somewhat gung-ho attitude of just doing whatever is necessary at the time is not going to get us where we need to be in this new, far more ordered environment. 

Shosholoza !

Perhaps the most satisfying development we've had in the last few months is the realisation by almost all the players that we need to work together. I stand accused of blatantly unjustified optimism, perhaps, but it's looking like people from all sides of the e-Science equation are understanding that a federation of resource and infrastructure providers is the key and that a single, all-serving project or institute can't solve our complex problems. We've had such great successes with SANReN / TENET (which together form the South African NREN) and the CHPC (which has the biggest and most powerful computer on the continent) that I'm really looking forward to seeing what we can do as a team.

It's dangerous out there, take this

Research is a fun place to work, specifically because it's confusing, exciting, challenging and changing constantly. We'll always be putting out fires with angry cats and telling our stories over beers; hopefully with a few less bearded faces than we currently have, if you know what I mean. We're never going to be able to automate our operations and provide the right training and documentation that we need for our team to be perfect. It's a constant learning curve - but at least that curve can be integrated, hopefully to something that converges (sorry - math joke). We are working on some cool stuff with our friends all over the world to alleviate the load on our humans. This blog will, hopefully, tell the tale of how the we learned to stop putting out fires with cats and learn to trust the robots and our rational side instead.