Tuesday 5 November 2013

ROC Shift Handover Report - October

Africa-Arabia Regional Operations Centre Situation Report and Shift Handover Report for October 2013

It's been a rough month with lots of work for us, and the end of the month always comes too quickly to prepare for it. We had the monthly shift handover meeting over Google+ last Friday the 1st of November where we discussed the progress made during the last month and the issues at hand at our sites for the next shifter on duty. 

Situation Report for SAGrid

As is customary, a short summary of the situation of our infrastructure is given here, followed by some comment and feedback of recent work.

Site readiness

Of our seven sites, only two are fully functional (ZA-UJ and ZA-WITS-CORE). Not surprisingly, these are also the only sites, along with ZA-MERAKA, which have passed EGI certification. While local services are running fine at the rest of the sites (ZA-UCT-ICTS, ZA-CHPC, ZA-UFS), they are failing various nagios tests due to local misconfigurations. In particular, there are strange issues with ZA-CHPC and ZA-UFS which Sakhile and Fanie are still trying to resolve respectively. 

Resource Infrastructure Provider Status

The core services provided by ZA-MERAKA for NGI_ZA are all fully functional and have achieved 100 % availability/reliability. However, we are still struggling to get the accounting records published from certified sites ZA-UJ and ZA-WITS-CORE to the regional instance at Meraka and from there to the central instance at the EGI central accounting portal. Uli is working on that. 

Situation Report for 1LS + TPM Activity

Likely due to the fact that there were several conferences this month, there was not much activity on the support side. The previous shifter on duty (DZ-eScience Grid) therefore had a pretty light shift. There are, however ongoing issues, with tickets open against all sites to varying priority. This means we have not met the target specified in our OLA regarding ticket management. This was highlighted during the meeting and impressed upon the next shifter on duty (TERNET). 

This is TERNET's first shift and we'll be working hard to ensure that during this month we get their site(s) up and running. 


Updates to the ROC

It's a time of big change in SAGrid and for the ROC as a whole - we are still trying to become familiar with the EGI Operating Procedures and trying to adapt our legacy procedures legacy procedures to them. While, thanks to EMI there is plenty of good quality documentation for the middleware, there is a still a lot of confusion regarding the applicability of various procedures and standards to be used in the production infrastructure, and many of the sites in Africa have legacy configurations which are affecting performance. It is the role of the Regional Operations Centre to bring order to this situation and to provide as accurate as possible an overview of the current and future status of the infrastructure, through monitors and planning. The input from sites is of course essential to this, as is the timeous and accurate response from their side to calls for updates, etc. 

We need to have a pretty serious re-design of the ROC website in the light of the work we've done and think carefully about how we want to expose the services run by it to the grid-ops community as well as the wider HPC, network and data infrastructures. 

Integration of new African Sites into the ROC

A renewed focus to finalise the MoU with EGI that will allow the CSIR to represent all African and Arabian sites when interoperating with EGI is underway. This is essentially the same MoU as was signed for integration of NGI_ZA, with the difference that all sites in the region covered by the ROC would be able to be registered in the GOCDB. Meraka would continue to play the coordination role by ensuring that these sites adhere to the necessary procedures, while acting as a liaison between African and Arabian technical and scientific communities and their European counterparts, in the grid context.  
 

Quality, Robots and the Coding Public

If you've been following our work recently, you'll have heard about "SAGrid-2.0". If not, it's basically judicious employment of Jenkins, Ansible/Puppet/etc, CVMFS and Github. In the last two weeks, a lot of that has been coming together quite nicely, after a few days spent with colleagues in Bloemfontein at the ZA-UFS site. We've been hacking away at the Ansible playbooks in Github and these are going to be tested at Meraka before being released, while we're still calling for puppet modules in use by our site admins to be contributed to the SAGridOps repo in Github...

A Jenkins instance was installed and configured for our needs, and this was used to write a few basic tests for building our supported applications. This CI approach is proving to be very flexible indeed and we're starting to converge on a strategy for a highly automated quality assurance chain which gives freedom to application maintainers and developers to do their work without intervention from the Ops side. The goal is to provide an endpoint to the "public" (in this case - the coding public who are interested in using the grid to run their or their community's applications) to automatically run predetermined tests to see whether an application will build and run on a standard worker node. A few other environments are also being conceived beyond this boring old vanilla setup : 
  • the GPGPU-enabled worker node
  • the "HPC" worker node with infiniband, OMP, MPI, etc available
  • the "next-gen" worker node, which will have the latest OS and middleware (untested) installed
Of course, input as to what you want to have your code build and execute on is welcome. 

Once code has passed functional tests, it should be moved from the testing area to the staging area, where final checks are run before it moves into the production repository. Since this will be a CvmFS repository, that application will then suddenly be everywhere. Assuming, of  course that our site admins actually have the repo mounted !

Upcoming meetings and conferences

There's plenty of other fun stuff coming up this month :

It's November : Keep on crunching !




No comments:

Post a Comment