Tuesday, 15 October 2013

What's coming in 2014

YAWTBWAW

Yet Another Way To Break What's Already Working. That's what most of the "grid stuff" seems like to many of our new (and not so new) system administrators. The risk-averse instincts of professional IT staff are quite wary of, perhaps even hostile to, new services to be deployed in their datacentre. Most of our operations team are full-time permanent staff paid to make sure that stuff stays up; this is a Good Thing if you're working in a production environment, but is not very conducive to rapid prototyping, testing and integration of new services. There is usually a very long lead time between a new technology or service appearing on our radar, and its adoption in the production infrastructure. It's time to bring some order to chaos.

Executable Infrastructure - part of SAGrid-2.0

This blog post will talk a bit about some of the technology and changes in methodology which we will be adopting first in South Africa and eventually - hopefully - all across the Africa-Arabia Regional Operations Centre to tame operations of the grid. Clearly, this is still work in progress and will be documented properly in due course. For now, let's just get our ideas down on paper and talk about the work as we get it done. For those of you reading who were at or read the output of the last SAGrid All Hands meeting , this part of the infrastructure renewal project that we're calling "SAGrid-2.0" is the so-called executable infrastructure part. Before we talk about what that actually means, let's just take a look at what it's being compared to - ie, how we currently do things. 

http://demotivationalpost.com/demotivators/12851255100/41mp-youre-doing-it-wrong.jpgHow not to operate a production infrastructure

I recently gave a talk at the e-Research Africa conference about "SAGrid-2.0"... which basically talked about the long list of ways we'd been doing things "wrong". During most of the training we'd been giving and had been through ourselves, there was was a nagging feeling that site admins just had to learn more and more tools, scripting languages, procedures, etc etc. For example : 
  1. There was no way to check whether a site had a properly declared configuration (say, at the YAIM site-info.def level)
  2. There was no way to reproduce issues that site admins might be having, or even that nagios would alert the operator to.
  3. Although there was an attempt made (which is still ongoing) to provide explicit Standard Operating Procedures, as well as a bootstrapping method to develop new SOP, it is still difficult to ensure that someone can execute these procedures without an expert understanding of each component or tool involved. 
  4. Finally, it was impossible to ascertain in which state/version a particular service at a given site was - mainly because that concept just did not exist in our infrastructure.
During All-Hands 2013 we had a couple of talks from team members on how to address this issue, using puppet and Ansible. Finally, a man was being taught to fish... If you'd like to know more, take a look at one of the fantastic videos by either of these projects on youtube.







Now what ? 

To the naive reader (not you,  by the way - you're awesome), this may seem like just another slab of meat on the operator's plate that they have to chew through and digest... How can you solve complexity by adding a further ingredient !? Well, we young padawans realised that this was actually a way to reduce complexity, by bring some order to our existing methodology. 

They can take our mangled, illegible code - but they'll never take our FREEDOM! 

I'm personally pretty taken by Ansible - for its simplicity and for the simple fact that a puppet capability is being developed by the very, very capable people at ZA-WITS-CORE and ZA-UJ, amongst other places. It's always a good idea to have alternative ways to solve problems, especially when they are as critical as maintaining a functional and reliable set of services. In the same way that applications are compiled on lots of different platforms and architectures, by lots of different compilers, we want to be able to "execute" our infrastructure with more than one set of tools. Plus, the whole philosophy of the grid pushes against any monolithic, monoculture of software and tools. 

Where are we going with this ? 

There's a realisation amongst SAGridOps that these orchestration tools mean that 
infrastructure = code

Actually, it's not an exaggeration these days to approximate that to first-order and just say "everything = code" - but that is a story for a different revolution. If infrastructure = code, then that means we can apply a lot of the development methodologies to the way we "code" our infrastructure (using ansible, puppet, etc). You can keep the infrastructure in a version controlled repository; you can collaborate to develop the infrastructure around this, using all the cool buzz-word sounding methodologies that have been been developed for software engineering over the years. Infrastructure can be versioned, it can be passed through tests from testing to production... and best, most of this can be automated to a large degree. 

Come on in, the water's great !

You'll be seeing a lot more Octocats -
 LOLcat's somewhat more productive cousin
This is where we are going... If you're up for the ride and want to have a good 'ol time, you can join the team by signing up for the SAGridOps organisation at github : https://github.com/SAGridOps. I'll be harping on about how awesome GitHub is in the future; for now, suffice to say it's given us a way to open up SAGrid to the developers out there who want to help us keep building an awesome set of services for e-Science. 




Situation Report - September

Operator : Situation Report

A main goal of this blog is to get a summary out to the wider public of what it is we do on a daily basis and what issues we have in our daily lives. I like to think we run a tight ship, although we probably could work a bit to improve our efficiency. One thing we do almost every month is have a ROC-wide meeting of the site admins, to discuss upcoming and current issues, as well as perform the formal handover of the First-Level Support and Ticket Process Managment Shifts (1LS/TPM) shift to the next support unit on duty.  
Hopefully, while writing this blog, we'll get some material together to update our somewhat decrepit website. It's not our fault - we're just always busy ! It will get a fix soon, 

What's happening on the grid ?

If only I had a rand/beer for every time I got asked that question ! It is indeed hard to know what is going on at each site and what the current issues are - and this is precisely the point of the GridOps meetings. Since the creation of the ROC, we've starting including more and more other sites on the African continent in our SitRep meetings, which started off on a weekly basis, and were more of a group chat/support group than a real operations meeting. We've got things down almost to an automated procedure by now so that we can meet once a month for under 30 minutes and exchange just the right amount of information. Actually, we have a draft procedure which describes what should be done, as well as a FAQ for those in the hot seat. For the really curious (and insomniac), the workflow is on the right.

Big steps

If you were in the meeting at the beginning of the month, you'll know that we finalised the integration with EGI.eu, but some sites are still undergoing certification. The SAGrid NGI monitor (based on nagios), has been keeping an eye on all of the services we publish in the GOCDB, and our sites are showing up in the operations portal. Although this may sound somewhat boring, it took about 2 years of work to get our sites and infrastructure up to par. Thanks again to everyone who work on this ! The main issue now is to ensure that we maintain the commitment to our OLA.

How are we doing ? 

The Operating Level Agreement agreed to by our sites forms the basis for the certification procedure, whereby they are included in the production infrastructure. It sounds harcore, but actually it's just a way of agreeing what our sites will be capable of  and actually it's pretty reasonable. More about these metrics in subsequent blogs, hopefully, but suffice to say for now that we're meeting one very important target - response time to issues : 



Monday, 14 October 2013

Order, order

Order, order !

Well, it had to start sometime... five years after we started isn't too late, right ? The South African National Grid was started way back in 2008 during a quick meeting at iThemba L.A.B.S.  where several directors of IT and research groups were present. The idea was simple : let's put the spare, underutilised, badly coordinated, kit we have at the universities to work, by integrating them into a national compute grid. Back in those days, grid was still spelt with capital letters, as if it were some strange alien beast which we had to bow down to and praise. We didn't have a network and almost everyone thought that you needed a Ph.D. in physics to use it. 
Nevertheless, we built it. UCT, iThemba, UFS, NWU, UJ and Wits put up their hands and Meraka made a position open to coordinate activities. We began training the system administrators of the sites in 2009 with a workshop at UCT and thus began the wave which we're still on to this day. I've been trying to convince other institutes to join, and thanks to the EPIKH programme, we ran quite a few training workshops in South Africa from 2009 to 2012. During this time, we forged a strong bond amongst the guys that would become the SAGrid Operations Team, or SAGridOps. This blog is dedicated to and written by them, the dudes in the trenches who make the grid work.

Take a deep breath and check your compass

Our points of reference have changed dramatically over the last few years and it has come time to put some order into our house here in SA. Not only have we seen the rise of cloud computing, we have probably figured out by now what to do with it. In my humble opinion, it has gone from a perceived threat to the lifeboat that will save us all. More about that in a different blog...
We started off with a clear idea of who our peers and support network were. The EGEE project was both our support staff as well as the middleware provider. Chosing gLite at that time was the natural option, but perhaps that issue too would be an interesting one to delve into sometime. The European Commission had funded a massive exchange programme, EPIKH, which I somehow managed to get the CSIR as a partner to, which, in my opinion, provided us the one thing we couldn't do on our own - train ourselves. We had a few use cases, but as time went on these grew less and less certain and turned out to be one of the main weaknesses of our "come one, come all" approach. 
Things have changed for the better and we need to adjust our sails to take advantage of this. We have a functioning Regional Operations Centre (although it needs some work), an resource infrastructure sharing MoU with EGI.eu. We have a new upstream middleware provider (although that is now, too, in a state of flux) and a far better idea of who we can and should serve in South Africa. The somewhat gung-ho attitude of just doing whatever is necessary at the time is not going to get us where we need to be in this new, far more ordered environment. 

Shosholoza !

Perhaps the most satisfying development we've had in the last few months is the realisation by almost all the players that we need to work together. I stand accused of blatantly unjustified optimism, perhaps, but it's looking like people from all sides of the e-Science equation are understanding that a federation of resource and infrastructure providers is the key and that a single, all-serving project or institute can't solve our complex problems. We've had such great successes with SANReN / TENET (which together form the South African NREN) and the CHPC (which has the biggest and most powerful computer on the continent) that I'm really looking forward to seeing what we can do as a team.

It's dangerous out there, take this

Research is a fun place to work, specifically because it's confusing, exciting, challenging and changing constantly. We'll always be putting out fires with angry cats and telling our stories over beers; hopefully with a few less bearded faces than we currently have, if you know what I mean. We're never going to be able to automate our operations and provide the right training and documentation that we need for our team to be perfect. It's a constant learning curve - but at least that curve can be integrated, hopefully to something that converges (sorry - math joke). We are working on some cool stuff with our friends all over the world to alleviate the load on our humans. This blog will, hopefully, tell the tale of how the we learned to stop putting out fires with cats and learn to trust the robots and our rational side instead.