Wednesday 26 February 2014

Towards the Africa-Arabia Regional Operations Centre.

We've made some progress towards the certification of the AfricaArabia Regional Operations Centre.


The AfricaArabia Regional Operations Centre (AAROC) is the container project which aims to coordinate the operations of grid, HPC and cloud sites in Africa and Arabia, essentially continuing and extending the work of EUMedGrid and SAGrid. This is essential to interoperability between infrastructures in our region and those in other parts of the world. It is done with EGI.eu as a trusted party in providing certain global services.

The ROC exists as an informal collaboration since the days of the original CHAIN project. Indeed there are several other ROCs in the world (AsiaPacific, ROC_LA for Latin America, etc) which were originally conceived to federate the operations of grid infrastructure for the worldwide LHC Compute Grid. Along with the evolution of EGEE into EGI, these were evolved into Operations Centres (OC) - the term now used to refer to self-contained infrastructures which need to interoperate, be they national or regional in scope. 

We've previously created an OC for South Africa (NGI_ZA) which we've successfully operated since 2011. We now have been mandated to extend this to the rest of the region, via the CHAIN-REDS project. 

Technical vs policy requirements

This work consists of two levels - technical and policy. The technical part is actually very easy - in fact it's been almost codified by EGI.eu through years of experience into an easy-to-follow procedure. However, before the technical work can start, a policy framework needs to be created and agreed upon. This is summarised in PROC 02 as follows:  
Political ValidationCASE 1. If an Operations Centre is already represented within the EGI Council and is ready to move from an EGEE ROC to an operational Operations Centre, we recommend that the Operations Centre political representative within the EGI Council notifies the EGI Chief Operations Officer that the respective Operations Centre is entering its validation cycle. At this point, technical validation can start.
CASE 2. If an Operations Centre is not represented within the EGI Council, and it is willing to be represented there, the Operations Centre needs to submit a request for admission to the Council. After the Operations Centre has been accepted by the Council, CASE 1 applies.
CASE 3. If a new Operations Centre is not represented within the EGI Council and is not interested in being part of it, but would still like to be a consumer of the EGI Global Services, then an MoU must be established with EGI. Once an MoU is in place technical validation can start.
Since Africa and Arabia are of course not represented in the EGI council, nor do we intend to be, we fall in Case 3, meaning we have to sign an MoU with EGI.eu. As previously mentioned, this work has been under way for quite a while now, due to the widely distributed nature of the collaboration.

Even at the national level, there's no single entity which formally represents the resource providers in a federated infrastructure, much less at a regional or continental level. In South Africa, we had the luxury of a flagbearing institute which could take operational responsibility the interoperability and thus sign the MoU with EGI.eu which specifies, amongst others, the operating and service levels. 

We've done the simplest thing possible to implement AAROC as a point of interoperability between African e-Infrastructures and their European and other counterparts - the CSIR's Meraka Institute will once again sign the MoU with EGI.eu. This addresses the policy issue in a simple way, but does leave potential issues related to the internal cohesion of the ROC members. Better to have something in place than nothing at all...

What's the holdup ? 

Currently the MoU is going up the ranks of the CSIR routing procedure. It has been vetted by the CSIR's legal department previously, with no issues found. The MoU has operational implications for SAGrid/SANREN, but beyond that no financial or IP implications, which makes things a bit easier. Nonetheless, since collaboration is at the base of technical collaboration in Africa we need the MoU to undergo the full internal support and understanding of the Meraka Institute. We're currently awaiting final signature by the Meraka Acting Director. 

Ok, take me to the fun bits

The technical procedure is much simpler and is proceding in parallel to the policy procedure. This essentially involves the creation of the AfricaArabia ROC as an operations centre in the various global services of EGI 
  1. GOCDB - the Global Operations Database which contains all the sites and respective service endpoints
  2. GGUS - the worldwide helpdesk
  3. Operations Portal - operational overview of the state of the infrastructure
  4. SAM - Service Availability and Monitoring and functional tests of various grid services
  5. Accounting Portal - the central accounting portal

But first...

Of course, creating the OC is very simple, but it needs to be done in a trustworthy way. That's why there are a lot of prerequisites before embarking on the technical work. Here's a quick roundup of where we stand on those prerequisites.

  1. Make sure your NGI is able to fulfill RP OLA https://documents.egi.eu/secure/ShowDocument?docid=463
    1. The OLA has been already satisfied for NGI_ZA, so this is just a change of scope for us.
  2. Decide about the Operations Centre name. Name for European Operations Centres should start with "NGI_"
    1. We've chosen AfricaArabia
  3. Decide whether to use the Operations Centre's own help desk system or use GGUS directly. If the Operations Centre wants to set up their own system they need to provide an interface for interaction to GGUS with the local ticketing system and follow the recommendations available at https://ggus.eu/pages/ggus-docs/interfaces/docu_ggus_interfaces.php.
    1. We're using GGUS directly. We used to have a regional (xGUS) instance, but support has been dropped for this and we need to be able to easily escalate and route tickets. The easiest way to do this is to create a support unit in GGUS.
  4. Set contact points 
    1. a set of mailing lists for site directors, operations, security, etc have been created under the domain africa-grid.org
  5. All certified Operations Centre sites need to be under Nagios monitoring. 
    1. ours is already configured, but the region needs to be changed from NGI_ZA to AfricaArabia
  6. Fill the FAQ document for the Operations Centre - 
    1. see  https://wiki.egi.eu/wiki/GGUS:AfricaArabia_FAQ
  7. Staff in the Operations Centre that should be granted a management role
    1. Go ahead and request your role (select NGI AfricaArabia and role ROD
  8. Staff in Operations Centre is familiar with Operational Procedures
  9. EGI adds people to relevant mailing lists..
The whole procedure is summarised in the table below : 

Since we have already added at least part of the operations team to the ops and dteam VO, and our nagios instance is up - we just need to change the OC in it from NGI_ZA to AfricaArabia - this puts us on Step 6 of PROC02

What's next ? 

The next step from our side is the validation of the nagios instance. This will take a few days, and in the meantime COD will be checking our information. Once that's done, we'll move the resource centres currently in NGI_ZA into AfricaArabia and decommission NGI_ZA.

Of course, this all assumes we can expect a speedy response from the CSIR's leadership and signature of the MoU !