Hello, Operator - What is Your Emergency ?: 2013

Tuesday, 5 November 2013

ROC Shift Handover Report - October

Africa-Arabia Regional Operations Centre Situation Report and Shift Handover Report for October 2013

It's been a rough month with lots of work for us, and the end of the month always comes too quickly to prepare for it. We had the monthly shift handover meeting over Google+ last Friday the 1st of November where we discussed the progress made during the last month and the issues at hand at our sites for the next shifter on duty.

Situation Report for SAGrid

As is customary, a short summary of the situation of our infrastructure is given here, followed by some comment and feedback of recent work.

Site readiness

Of our seven sites, only two are fully functional (ZA-UJ and ZA-WITS-CORE). Not surprisingly, these are also the only sites, along with ZA-MERAKA, which have passed EGI certification. While local services are running fine at the rest of the sites (ZA-UCT-ICTS, ZA-CHPC, ZA-UFS), they are failing various nagios tests due to local misconfigurations. In particular, there are strange issues with ZA-CHPC and ZA-UFS which Sakhile and Fanie are still trying to resolve respectively.

Resource Infrastructure Provider Status

The core services provided by ZA-MERAKA for NGI_ZA are all fully functional and have achieved 100 % availability/reliability. However, we are still struggling to get the accounting records published from certified sites ZA-UJ and ZA-WITS-CORE to the regional instance at Meraka and from there to the central instance at the EGI central accounting portal. Uli is working on that.

Situation Report for 1LS + TPM Activity

Likely due to the fact that there were several conferences this month, there was not much activity on the support side. The previous shifter on duty (DZ-eScience Grid) therefore had a pretty light shift. There are, however ongoing issues, with tickets open against all sites to varying priority. This means we have not met the target specified in our OLA regarding ticket management. This was highlighted during the meeting and impressed upon the next shifter on duty (TERNET).

This is TERNET's first shift and we'll be working hard to ensure that during this month we get their site(s) up and running.

Updates to the ROC

It's a time of big change in SAGrid and for the ROC as a whole - we are still trying to become familiar with the EGI Operating Procedures and trying to adapt our legacy procedures legacy procedures to them. While, thanks to EMI there is plenty of good quality documentation for the middleware, there is a still a lot of confusion regarding the applicability of various procedures and standards to be used in the production infrastructure, and many of the sites in Africa have legacy configurations which are affecting performance. It is the role of the Regional Operations Centre to bring order to this situation and to provide as accurate as possible an overview of the current and future status of the infrastructure, through monitors and planning. The input from sites is of course essential to this, as is the timeous and accurate response from their side to calls for updates, etc.

We need to have a pretty serious re-design of the ROC website in the light of the work we've done and think carefully about how we want to expose the services run by it to the grid-ops community as well as the wider HPC, network and data infrastructures.

Integration of new African Sites into the ROC

A renewed focus to finalise the MoU with EGI that will allow the CSIR to represent all African and Arabian sites when interoperating with EGI is underway. This is essentially the same MoU as was signed for integration of NGI_ZA, with the difference that all sites in the region covered by the ROC would be able to be registered in the GOCDB. Meraka would continue to play the coordination role by ensuring that these sites adhere to the necessary procedures, while acting as a liaison between African and Arabian technical and scientific communities and their European counterparts, in the grid context.

Quality, Robots and the Coding Public

If you've been following our work recently, you'll have heard about "SAGrid-2.0". If not, it's basically judicious employment of Jenkins, Ansible/Puppet/etc, CVMFS and Github. In the last two weeks, a lot of that has been coming together quite nicely, after a few days spent with colleagues in Bloemfontein at the ZA-UFS site. We've been hacking away at the Ansible playbooks in Github and these are going to be tested at Meraka before being released, while we're still calling for puppet modules in use by our site admins to be contributed to the SAGridOps repo in Github...

A Jenkins instance was installed and configured for our needs, and this was used to write a few basic tests for building our supported applications. This CI approach is proving to be very flexible indeed and we're starting to converge on a strategy for a highly automated quality assurance chain which gives freedom to application maintainers and developers to do their work without intervention from the Ops side. The goal is to provide an endpoint to the "public" (in this case - the coding public who are interested in using the grid to run their or their community's applications) to automatically run predetermined tests to see whether an application will build and run on a standard worker node. A few other environments are also being conceived beyond this boring old vanilla setup :

the GPGPU-enabled worker node
the "HPC" worker node with infiniband, OMP, MPI, etc available
the "next-gen" worker node, which will have the latest OS and middleware (untested) installed

Of course, input as to what you want to have your code build and execute on is welcome.

Once code has passed functional tests, it should be moved from the testing area to the staging area, where final checks are run before it moves into the production repository. Since this will be a CvmFS repository, that application will then suddenly be everywhere. Assuming, of course that our site admins actually have the repo mounted !

Upcoming meetings and conferences

There's plenty of other fun stuff coming up this month :

Ubuntunet Connect conference and joint ei4Africa/CHAIN-REDS workshop on e-Infrastructures for African VRCs in Kigali, Rwanda
Annual conference of the Centre for High-Performance Computing - plenty going on there so have a look and consider coming to join us there.

It's November : Keep on crunching !

Tuesday, 15 October 2013

What's coming in 2014

YAWTBWAW

Yet Another Way To Break What's Already Working. That's what most of the "grid stuff" seems like to many of our new (and not so new) system administrators. The risk-averse instincts of professional IT staff are quite wary of, perhaps even hostile to, new services to be deployed in their datacentre. Most of our operations team are full-time permanent staff paid to make sure that stuff stays up; this is a Good Thing if you're working in a production environment, but is not very conducive to rapid prototyping, testing and integration of new services. There is usually a very long lead time between a new technology or service appearing on our radar, and its adoption in the production infrastructure. It's time to bring some order to chaos.

Executable Infrastructure - part of SAGrid-2.0

This blog post will talk a bit about some of the technology and changes in methodology which we will be adopting first in South Africa and eventually - hopefully - all across the Africa-Arabia Regional Operations Centre to tame operations of the grid. Clearly, this is still work in progress and will be documented properly in due course. For now, let's just get our ideas down on paper and talk about the work as we get it done. For those of you reading who were at or read the output of the last SAGrid All Hands meeting , this part of the infrastructure renewal project that we're calling "SAGrid-2.0" is the so-called executable infrastructure part. Before we talk about what that actually means, let's just take a look at what it's being compared to - ie, how we currently do things.

How not to operate a production infrastructure

I recently gave a talk at the e-Research Africa conference about "SAGrid-2.0"... which basically talked about the long list of ways we'd been doing things "wrong". During most of the training we'd been giving and had been through ourselves, there was was a nagging feeling that site admins just had to learn more and more tools, scripting languages, procedures, etc etc. For example :

There was no way to check whether a site had a properly declared configuration (say, at the YAIM site-info.def level)
There was no way to reproduce issues that site admins might be having, or even that nagios would alert the operator to.
Although there was an attempt made (which is still ongoing) to provide explicit Standard Operating Procedures, as well as a bootstrapping method to develop new SOP, it is still difficult to ensure that someone can execute these procedures without an expert understanding of each component or tool involved.
Finally, it was impossible to ascertain in which state/version a particular service at a given site was - mainly because that concept just did not exist in our infrastructure.

During All-Hands 2013 we had a couple of talks from team members on how to address this issue, using puppet and Ansible. Finally, a man was being taught to fish... If you'd like to know more, take a look at one of the fantastic videos by either of these projects on youtube.

Now what ?

To the naive reader (not you, by the way - you're awesome), this may seem like just another slab of meat on the operator's plate that they have to chew through and digest... How can you solve complexity by adding a further ingredient !? Well, we young padawans realised that this was actually a way to reduce complexity, by bring some order to our existing methodology.

They can take our mangled, illegible code - but they'll never take our FREEDOM!

I'm personally pretty taken by Ansible - for its simplicity and for the simple fact that a puppet capability is being developed by the very, very capable people at ZA-WITS-CORE and ZA-UJ, amongst other places. It's always a good idea to have alternative ways to solve problems, especially when they are as critical as maintaining a functional and reliable set of services. In the same way that applications are compiled on lots of different platforms and architectures, by lots of different compilers, we want to be able to "execute" our infrastructure with more than one set of tools. Plus, the whole philosophy of the grid pushes against any monolithic, monoculture of software and tools.

Where are we going with this ?

There's a realisation amongst SAGridOps that these orchestration tools mean that

infrastructure = code

Actually, it's not an exaggeration these days to approximate that to first-order and just say "everything = code" - but that is a story for a different revolution. If infrastructure = code, then that means we can apply a lot of the development methodologies to the way we "code" our infrastructure (using ansible, puppet, etc). You can keep the infrastructure in a version controlled repository; you can collaborate to develop the infrastructure around this, using all the cool buzz-word sounding methodologies that have been been developed for software engineering over the years. Infrastructure can be versioned, it can be passed through tests from testing to production... and best, most of this can be automated to a large degree.

Come on in, the water's great !

You'll be seeing a lot more Octocats -
LOLcat's somewhat more productive cousin

This is where we are going... If you're up for the ride and want to have a good 'ol time, you can join the team by signing up for the SAGridOps organisation at github : https://github.com/SAGridOps. I'll be harping on about how awesome GitHub is in the future; for now, suffice to say it's given us a way to open up SAGrid to the developers out there who want to help us keep building an awesome set of services for e-Science.

Situation Report - September

Operator : Situation Report

A main goal of this blog is to get a summary out to the wider public of what it is we do on a daily basis and what issues we have in our daily lives. I like to think we run a tight ship, although we probably could work a bit to improve our efficiency. One thing we do almost every month is have a ROC-wide meeting of the site admins, to discuss upcoming and current issues, as well as perform the formal handover of the First-Level Support and Ticket Process Managment Shifts (1LS/TPM) shift to the next support unit on duty.

Hopefully, while writing this blog, we'll get some material together to update our somewhat decrepit website. It's not our fault - we're just always busy ! It will get a fix soon,

What's happening on the grid ?

If only I had a rand/beer for every time I got asked that question ! It is indeed hard to know what is going on at each site and what the current issues are - and this is precisely the point of the GridOps meetings. Since the creation of the ROC, we've starting including more and more other sites on the African continent in our SitRep meetings, which started off on a weekly basis, and were more of a group chat/support group than a real operations meeting. We've got things down almost to an automated procedure by now so that we can meet once a month for under 30 minutes and exchange just the right amount of information. Actually, we have a draft procedure which describes what should be done, as well as a FAQ for those in the hot seat. For the really curious (and insomniac), the workflow is on the right.

Big steps

If you were in the meeting at the beginning of the month, you'll know that we finalised the integration with EGI.eu, but some sites are still undergoing certification. The SAGrid NGI monitor (based on nagios), has been keeping an eye on all of the services we publish in the GOCDB, and our sites are showing up in the operations portal. Although this may sound somewhat boring, it took about 2 years of work to get our sites and infrastructure up to par. Thanks again to everyone who work on this ! The main issue now is to ensure that we maintain the commitment to our OLA.

How are we doing ?

The Operating Level Agreement agreed to by our sites forms the basis for the certification procedure, whereby they are included in the production infrastructure. It sounds harcore, but actually it's just a way of agreeing what our sites will be capable of and actually it's pretty reasonable. More about these metrics in subsequent blogs, hopefully, but suffice to say for now that we're meeting one very important target - response time to issues :

Monday, 14 October 2013

Order, order

Order, order !

Well, it had to start sometime... five years after we started isn't too late, right ? The South African National Grid was started way back in 2008 during a quick meeting at iThemba L.A.B.S. where several directors of IT and research groups were present. The idea was simple : let's put the spare, underutilised, badly coordinated, kit we have at the universities to work, by integrating them into a national compute grid. Back in those days, grid was still spelt with capital letters, as if it were some strange alien beast which we had to bow down to and praise. We didn't have a network and almost everyone thought that you needed a Ph.D. in physics to use it.

Nevertheless, we built it. UCT, iThemba, UFS, NWU, UJ and Wits put up their hands and Meraka made a position open to coordinate activities. We began training the system administrators of the sites in 2009 with a workshop at UCT and thus began the wave which we're still on to this day. I've been trying to convince other institutes to join, and thanks to the EPIKH programme, we ran quite a few training workshops in South Africa from 2009 to 2012. During this time, we forged a strong bond amongst the guys that would become the SAGrid Operations Team, or SAGridOps. This blog is dedicated to and written by them, the dudes in the trenches who make the grid work.

Take a deep breath and check your compass

Our points of reference have changed dramatically over the last few years and it has come time to put some order into our house here in SA. Not only have we seen the rise of cloud computing, we have probably figured out by now what to do with it. In my humble opinion, it has gone from a perceived threat to the lifeboat that will save us all. More about that in a different blog...

We started off with a clear idea of who our peers and support network were. The EGEE project was both our support staff as well as the middleware provider. Chosing gLite at that time was the natural option, but perhaps that issue too would be an interesting one to delve into sometime. The European Commission had funded a massive exchange programme, EPIKH, which I somehow managed to get the CSIR as a partner to, which, in my opinion, provided us the one thing we couldn't do on our own - train ourselves. We had a few use cases, but as time went on these grew less and less certain and turned out to be one of the main weaknesses of our "come one, come all" approach.

Things have changed for the better and we need to adjust our sails to take advantage of this. We have a functioning Regional Operations Centre (although it needs some work), an resource infrastructure sharing MoU with EGI.eu. We have a new upstream middleware provider (although that is now, too, in a state of flux) and a far better idea of who we can and should serve in South Africa. The somewhat gung-ho attitude of just doing whatever is necessary at the time is not going to get us where we need to be in this new, far more ordered environment.

Shosholoza !

Perhaps the most satisfying development we've had in the last few months is the realisation by almost all the players that we need to work together. I stand accused of blatantly unjustified optimism, perhaps, but it's looking like people from all sides of the e-Science equation are understanding that a federation of resource and infrastructure providers is the key and that a single, all-serving project or institute can't solve our complex problems. We've had such great successes with SANReN / TENET (which together form the South African NREN) and the CHPC (which has the biggest and most powerful computer on the continent) that I'm really looking forward to seeing what we can do as a team.

It's dangerous out there, take this

Research is a fun place to work, specifically because it's confusing, exciting, challenging and changing constantly. We'll always be putting out fires with angry cats and telling our stories over beers; hopefully with a few less bearded faces than we currently have, if you know what I mean. We're never going to be able to automate our operations and provide the right training and documentation that we need for our team to be perfect. It's a constant learning curve - but at least that curve can be integrated, hopefully to something that converges (sorry - math joke). We are working on some cool stuff with our friends all over the world to alleviate the load on our humans. This blog will, hopefully, tell the tale of how the we learned to stop putting out fires with cats and learn to trust the robots and our rational side instead.