Tuesday 15 October 2013

Situation Report - September

Operator : Situation Report

A main goal of this blog is to get a summary out to the wider public of what it is we do on a daily basis and what issues we have in our daily lives. I like to think we run a tight ship, although we probably could work a bit to improve our efficiency. One thing we do almost every month is have a ROC-wide meeting of the site admins, to discuss upcoming and current issues, as well as perform the formal handover of the First-Level Support and Ticket Process Managment Shifts (1LS/TPM) shift to the next support unit on duty.  
Hopefully, while writing this blog, we'll get some material together to update our somewhat decrepit website. It's not our fault - we're just always busy ! It will get a fix soon, 

What's happening on the grid ?

If only I had a rand/beer for every time I got asked that question ! It is indeed hard to know what is going on at each site and what the current issues are - and this is precisely the point of the GridOps meetings. Since the creation of the ROC, we've starting including more and more other sites on the African continent in our SitRep meetings, which started off on a weekly basis, and were more of a group chat/support group than a real operations meeting. We've got things down almost to an automated procedure by now so that we can meet once a month for under 30 minutes and exchange just the right amount of information. Actually, we have a draft procedure which describes what should be done, as well as a FAQ for those in the hot seat. For the really curious (and insomniac), the workflow is on the right.

Big steps

If you were in the meeting at the beginning of the month, you'll know that we finalised the integration with EGI.eu, but some sites are still undergoing certification. The SAGrid NGI monitor (based on nagios), has been keeping an eye on all of the services we publish in the GOCDB, and our sites are showing up in the operations portal. Although this may sound somewhat boring, it took about 2 years of work to get our sites and infrastructure up to par. Thanks again to everyone who work on this ! The main issue now is to ensure that we maintain the commitment to our OLA.

How are we doing ? 

The Operating Level Agreement agreed to by our sites forms the basis for the certification procedure, whereby they are included in the production infrastructure. It sounds harcore, but actually it's just a way of agreeing what our sites will be capable of  and actually it's pretty reasonable. More about these metrics in subsequent blogs, hopefully, but suffice to say for now that we're meeting one very important target - response time to issues : 



No comments:

Post a Comment