Monday 14 October 2013

Order, order

Order, order !

Well, it had to start sometime... five years after we started isn't too late, right ? The South African National Grid was started way back in 2008 during a quick meeting at iThemba L.A.B.S.  where several directors of IT and research groups were present. The idea was simple : let's put the spare, underutilised, badly coordinated, kit we have at the universities to work, by integrating them into a national compute grid. Back in those days, grid was still spelt with capital letters, as if it were some strange alien beast which we had to bow down to and praise. We didn't have a network and almost everyone thought that you needed a Ph.D. in physics to use it. 
Nevertheless, we built it. UCT, iThemba, UFS, NWU, UJ and Wits put up their hands and Meraka made a position open to coordinate activities. We began training the system administrators of the sites in 2009 with a workshop at UCT and thus began the wave which we're still on to this day. I've been trying to convince other institutes to join, and thanks to the EPIKH programme, we ran quite a few training workshops in South Africa from 2009 to 2012. During this time, we forged a strong bond amongst the guys that would become the SAGrid Operations Team, or SAGridOps. This blog is dedicated to and written by them, the dudes in the trenches who make the grid work.

Take a deep breath and check your compass

Our points of reference have changed dramatically over the last few years and it has come time to put some order into our house here in SA. Not only have we seen the rise of cloud computing, we have probably figured out by now what to do with it. In my humble opinion, it has gone from a perceived threat to the lifeboat that will save us all. More about that in a different blog...
We started off with a clear idea of who our peers and support network were. The EGEE project was both our support staff as well as the middleware provider. Chosing gLite at that time was the natural option, but perhaps that issue too would be an interesting one to delve into sometime. The European Commission had funded a massive exchange programme, EPIKH, which I somehow managed to get the CSIR as a partner to, which, in my opinion, provided us the one thing we couldn't do on our own - train ourselves. We had a few use cases, but as time went on these grew less and less certain and turned out to be one of the main weaknesses of our "come one, come all" approach. 
Things have changed for the better and we need to adjust our sails to take advantage of this. We have a functioning Regional Operations Centre (although it needs some work), an resource infrastructure sharing MoU with EGI.eu. We have a new upstream middleware provider (although that is now, too, in a state of flux) and a far better idea of who we can and should serve in South Africa. The somewhat gung-ho attitude of just doing whatever is necessary at the time is not going to get us where we need to be in this new, far more ordered environment. 

Shosholoza !

Perhaps the most satisfying development we've had in the last few months is the realisation by almost all the players that we need to work together. I stand accused of blatantly unjustified optimism, perhaps, but it's looking like people from all sides of the e-Science equation are understanding that a federation of resource and infrastructure providers is the key and that a single, all-serving project or institute can't solve our complex problems. We've had such great successes with SANReN / TENET (which together form the South African NREN) and the CHPC (which has the biggest and most powerful computer on the continent) that I'm really looking forward to seeing what we can do as a team.

It's dangerous out there, take this

Research is a fun place to work, specifically because it's confusing, exciting, challenging and changing constantly. We'll always be putting out fires with angry cats and telling our stories over beers; hopefully with a few less bearded faces than we currently have, if you know what I mean. We're never going to be able to automate our operations and provide the right training and documentation that we need for our team to be perfect. It's a constant learning curve - but at least that curve can be integrated, hopefully to something that converges (sorry - math joke). We are working on some cool stuff with our friends all over the world to alleviate the load on our humans. This blog will, hopefully, tell the tale of how the we learned to stop putting out fires with cats and learn to trust the robots and our rational side instead.


No comments:

Post a Comment