Hello, Operator - What is Your Emergency ?: 2014

Tuesday, 4 March 2014

Executing AAROC with Ansible

TL;DR : DevOps is helping us quickly deploy new sites in Africa. KENET and TERNET are blazing a trail. Ansible is pretty awesome, but can be improved.

We've almost finished the integration of Africa-Arabia ROC with EGI.eu. While we're waiting for the MoU to pass through the CSIR's approval chain, colleagues at KENET and TERNET have been preparing their sites for grid deployment.

More grid for Africa

"The grid" as a cultural/scientific phenomenon may have peaked in most parts of the world, but it never even started here in Africa. Underlying this was not the lack of desire, scientific know-how, or important problems to address with computational infrastructures. These are certainly present throughout Africa, and we're finding more and more examples as we go. What was missing was the underlying support structures - the network, funding schemes, mobility, a shared goal; all of which are necessary for collaboration.

These are having a bright, perhaps transient moment in our region right now, and we need to take advantage of the various support activities while they are in the limelight, since nothing guarantees that they will continue as such (alas, on the contrary). However, perhaps the most concrete development so far is not the computing paradigm, or the technology being used, but the methodology adopted (at least by our little collaboration - I don't want to speak for others who might be a bit more evolved than we are !). By insisting that the grid was complicated, and required dedicated site administrator training prior to any sites being installed, we were given a perfect excuse for delaying this work.

What took you so long !?

We've spoken about this for about 3 years now, usually getting all excited about integration of the sites around the time Ubuntunet Connect rolls around... but so far it just hasn't come together. There are good and not-so-good reasons for this. Good reasons include :

lack of competent Certificate Authorities in the region
overhead in training local site admins in grid middleware
overhead in manually operating sites and ensuring availability
connectivity issues

These barriers have all recently been drastically reduced recently, in places, not all. Also, the distribution of hardware available to do scientific computing also widely varies, but we've got two very competent and capable NRENs in the form of KENET and TERNET to work with. We've built up a lot of trust over recent years working with their engineers on various projects, and it doesn't hurt that they're both members of the ei4Africa project.

Finally, the CHAIN-REDS project has strongly supported this integration between EGI.eu and African NGI's (if you want to call them that), first in the case of NGI_ZA in South Africa and now in the case of AfricaArabia. Being integrated into EGI.eu's operational tools gives us a a lot more power to understand what's wrong with our sites and preemptively fix them.

Ok, but gridding is still hard !

True, we're a lot speedier now than when SAGrid was first starting out for two reasons :

years of experience
manual labour is for making olive oil, not ICT infrastructure

Yeah, we're DevOps now... This is awesome in so many ways. Let me count them

1) Clear starting point

Using the GOCDB, the bare minimum for deploying a site is very clear. You need site information, a few IP's and, assuming that the hosts are running a supported OS, we're good to go. In the case of KENET's site deployment, this was done in a matter of 2 emails and a screenshot... we could have done it in one go, but the site admins didn't have access to the GOCDB (no IGTF cert).

2) Less complexity

We've developed the playbooks necessary to get a site from OS to grid in a few simple steps. One credential, no "'follow this extremely complicated guide, which will obviously leave out some major details, or special case that applies to you". Speaking of which...

3) Better understanding of everything that is needed to deploy a site

The great thing about DevOps in general is that everything=code. We can thus code for these special cases. What is more, since Ansible in particular is very easy to read, a new site admin can very quickly understand how a certain state was reached.

4) Easier to collaborate

Everything's version-controlled and kept in a a Github repo. Need I say more ? (probably, but let's leave that to a later edition).

5) Try before you buy

Before deploying services into a production environment, we can and do test them first in a development environment with machines provisioned in a private cloud. This helps a lot to speed up the development of the DevOps code, as well as ensure that moves from one release of middleware/OS etc to the next are smooth and go as expected.

East African sites

Right now we're working on deploying two sites which have been on or radar (as AfricaGrid) for a very long time, at the University of Nairobi and the Tanzanian Dar es Salaam Institute of Technology. There are small clusters there in active use by researchers, running mainly bioinformatics, mathematics and weather research applications. We've added them to the GOCDB already and the Ansible accounts have been provisioned by the local site admins.

Dealing with details

Of course, this is a litmus test for our preaching about how DevOps is the future and everything is easy now. We ourselves forked the Ansible code from GRNET (and Wits is using CERN's puppet modules), having adapted it to our local needs. This is almost as simple as changing a few variables, but we also need to make as few assumptions on the OS and installed packages as possible. At KENET for example, the machines came with no SELinux... you would assume that this is great, since the first thing the EMI docs tell you is to turn it off - but it actually breaks a lot of the orchestration code.

The same thing goes for the firewall - we assume that iptables is installed, so if it's not we have to provision it and then configure it. Then, we've got the issue of previously-installed cluster management systems like ROCKS to deal with. This is the case at DIT, and was also the case at SAAO... by no means do we have the right as a federation to insist on specific software at a site, apart from the basic middleware needed to respect the interoperability standards, so we have to write playbooks and modules to configure services taking into account what's already there.

These special cases are highlighting how much we actually assume about the prior state of a cluster, and the power of DevOps to plan, test and ensure a smooth deployment.

Stay tuned

These are the firs sites outside of South Africa that we are deploying directly as a regional infrastructure (sites in North Africa are migrating middleware right now, but have always been part of a regional infrastructure). We want to do it right; we're going step by verified step, so that when when we find new sites in Africa - and there are a huge amount of them ! - we can integrate them in a flash.

Stay tuned.

Wednesday, 26 February 2014

Towards the Africa-Arabia Regional Operations Centre.

We've made some progress towards the certification of the AfricaArabia Regional Operations Centre.

The AfricaArabia Regional Operations Centre (AAROC) is the container project which aims to coordinate the operations of grid, HPC and cloud sites in Africa and Arabia, essentially continuing and extending the work of EUMedGrid and SAGrid. This is essential to interoperability between infrastructures in our region and those in other parts of the world. It is done with EGI.eu as a trusted party in providing certain global services.

The ROC exists as an informal collaboration since the days of the original CHAIN project. Indeed there are several other ROCs in the world (AsiaPacific, ROC_LA for Latin America, etc) which were originally conceived to federate the operations of grid infrastructure for the worldwide LHC Compute Grid. Along with the evolution of EGEE into EGI, these were evolved into Operations Centres (OC) - the term now used to refer to self-contained infrastructures which need to interoperate, be they national or regional in scope.

We've previously created an OC for South Africa (NGI_ZA) which we've successfully operated since 2011. We now have been mandated to extend this to the rest of the region, via the CHAIN-REDS project.

Technical vs policy requirements

This work consists of two levels - technical and policy. The technical part is actually very easy - in fact it's been almost codified by EGI.eu through years of experience into an easy-to-follow procedure. However, before the technical work can start, a policy framework needs to be created and agreed upon. This is summarised in PROC 02 as follows:

Political ValidationCASE 1. If an Operations Centre is already represented within the EGI Council and is ready to move from an EGEE ROC to an operational Operations Centre, we recommend that the Operations Centre political representative within the EGI Council notifies the EGI Chief Operations Officer that the respective Operations Centre is entering its validation cycle. At this point, technical validation can start.
CASE 2. If an Operations Centre is not represented within the EGI Council, and it is willing to be represented there, the Operations Centre needs to submit a request for admission to the Council. After the Operations Centre has been accepted by the Council, CASE 1 applies.
CASE 3. If a new Operations Centre is not represented within the EGI Council and is not interested in being part of it, but would still like to be a consumer of the EGI Global Services, then an MoU must be established with EGI. Once an MoU is in place technical validation can start.

Since Africa and Arabia are of course not represented in the EGI council, nor do we intend to be, we fall in Case 3, meaning we have to sign an MoU with EGI.eu. As previously mentioned, this work has been under way for quite a while now, due to the widely distributed nature of the collaboration.

Even at the national level, there's no single entity which formally represents the resource providers in a federated infrastructure, much less at a regional or continental level. In South Africa, we had the luxury of a flagbearing institute which could take operational responsibility the interoperability and thus sign the MoU with EGI.eu which specifies, amongst others, the operating and service levels.

We've done the simplest thing possible to implement AAROC as a point of interoperability between African e-Infrastructures and their European and other counterparts - the CSIR's Meraka Institute will once again sign the MoU with EGI.eu. This addresses the policy issue in a simple way, but does leave potential issues related to the internal cohesion of the ROC members. Better to have something in place than nothing at all...

What's the holdup ?

Currently the MoU is going up the ranks of the CSIR routing procedure. It has been vetted by the CSIR's legal department previously, with no issues found. The MoU has operational implications for SAGrid/SANREN, but beyond that no financial or IP implications, which makes things a bit easier. Nonetheless, since collaboration is at the base of technical collaboration in Africa we need the MoU to undergo the full internal support and understanding of the Meraka Institute. We're currently awaiting final signature by the Meraka Acting Director.

Ok, take me to the fun bits

The technical procedure is much simpler and is proceding in parallel to the policy procedure. This essentially involves the creation of the AfricaArabia ROC as an operations centre in the various global services of EGI :

GOCDB - the Global Operations Database which contains all the sites and respective service endpoints
GGUS - the worldwide helpdesk
Operations Portal - operational overview of the state of the infrastructure
SAM - Service Availability and Monitoring and functional tests of various grid services
Accounting Portal - the central accounting portal

But first...

Of course, creating the OC is very simple, but it needs to be done in a trustworthy way. That's why there are a lot of prerequisites before embarking on the technical work. Here's a quick roundup of where we stand on those prerequisites.

Make sure your NGI is able to fulfill RP OLA https://documents.egi.eu/secure/ShowDocument?docid=463

The OLA has been already satisfied for NGI_ZA, so this is just a change of scope for us.

Decide about the Operations Centre name. Name for European Operations Centres should start with "NGI_"

We've chosen AfricaArabia

Decide whether to use the Operations Centre's own help desk system or use GGUS directly. If the Operations Centre wants to set up their own system they need to provide an interface for interaction to GGUS with the local ticketing system and follow the recommendations available at https://ggus.eu/pages/ggus-docs/interfaces/docu_ggus_interfaces.php.

We're using GGUS directly. We used to have a regional (xGUS) instance, but support has been dropped for this and we need to be able to easily escalate and route tickets. The easiest way to do this is to create a support unit in GGUS.

Set contact points -

a set of mailing lists for site directors, operations, security, etc have been created under the domain africa-grid.org

All certified Operations Centre sites need to be under Nagios monitoring.

ours is already configured, but the region needs to be changed from NGI_ZA to AfricaArabia

Fill the FAQ document for the Operations Centre -

see https://wiki.egi.eu/wiki/GGUS:AfricaArabia_FAQ

Staff in the Operations Centre that should be granted a management role

Go ahead and request your role (select NGI AfricaArabia and role ROD)

Staff in Operations Centre is familiar with Operational Procedures
EGI adds people to relevant mailing lists..

The whole procedure is summarised in the table below :

Since we have already added at least part of the operations team to the ops and dteam VO, and our nagios instance is up - we just need to change the OC in it from NGI_ZA to AfricaArabia - this puts us on Step 6 of PROC02

What's next ?

The next step from our side is the validation of the nagios instance. This will take a few days, and in the meantime COD will be checking our information. Once that's done, we'll move the resource centres currently in NGI_ZA into AfricaArabia and decommission NGI_ZA.

Of course, this all assumes we can expect a speedy response from the CSIR's leadership and signature of the MoU !

Monday, 6 January 2014

Renew, Refactor, Release, Repeat. Now you are Ready.

In this issue: What have we been up to from October to December ?; The end of year collapse (ie, Christmas for sysadmins); updates on our executable infrastructure; thoughts on collaboration; why it's called SAGrid-2.0; and happy new year !

Collapse

TL;DR - the end of the academic year is hell, but then it's ok. for about a week.

One of the ironies of writing updates for an active community is that the more activity there is, the less time there is to write about it and keep all the interested parties informed. The last few months, from October to December 2013, were - at least for me - a blur of meetings, planning, conferences, and other activities, including some development work here and there for new services on the grid.

It was indeed an epic mission, kicked off by the first (and ambitiously-titled) e-Research Africa conference, organised by ASAUDIT and well-attended by members of several Australasian and European e-research infrastructures. From our side, we gave a good series of presentations on the Regional Operations Centre and SAGrid-2.0. There was also a hefty contingent of presentations from the CHAIN-REDS project, presenting work done on the science gateways, semantic search of data and document repositories, and the interoperability of data and computing infrastructures. All of this has led to some interesting new links between the CHAIN-REDS project and the University of Cape Town.

Renewal.

TL;DR - A new HPC forum is created. SAGrid-2.0 is taking shape; First Release 02/14

HPC Forum is created

Without further ado, a quick reminder that a forum for HPC system adminstrators has been created at the e-Research Africa conference and held its first meeting at the CHPC conference in December in Cape Town. It's a place specifically for HPC sysadmins to get together and share their experience and issues, and is supported by the AfricaGrid Regional Operations Centre (via the CHAIN-REDS project for now). Anyone interested in joining can get hold of Peter van Heusden at SANBI.

Raising our game

We're written about this before, but a major theme for 2014 is renewal. We started this project in 2009 to build a federated, distributed computing infrastructure for South African research communities, based on the paradigm of grid computing. That paradigm itself has seen fundamental changes - not only in the technology which is used to implement it, but also - and perhaps more importantly - in the community and business models that are being adopted to operate it.
Thanks to the closer collaboration with EGI.eu via the research infrastructure provider MoU's that the CSIR has signed with them, we are able to better understand and hence adopt best practice for distributed computing infrastructures. This also means that our inadequacies as an infrastructure and as resource-providing sites are put into stark relief, since we have to publish our availability and reliability (amongst other things, such as accounting data) to EGI.eu. We'll just have to overlook the fact that during December, as an infrastructure our A/R was close to 3%... ouch. Yes, people were on holiday, but no, that's not an excuse. We'll have to make great strides towards improving this over the next few weeks, especially at the site level - only 2 of 6 sites are fully functional and passing tests. This is entirely unacceptable and while all our site admins need to take some of the responsibility for this, the final onus is on the Regional Operator on Duty (ROD). Clearly this role is not yet mature and needs some work. Documentation, Training, Certification.

Robots' running on the stage

Wouldn't it be great if we could eliminate humans from our operations entirely ? I mean... just wind the grid up and watch it go ? That would be rad. No, I submit that that would be très rad.

Avid readers (ha ha) will recall that we started some work on this back in 2013, forking the GRNET repo for executable infrastructure on github. This is a set of ansible playbooks and other code for deploying UMD or EMI middleware from the OS up. Of course, having a robot like Ansible is no good for you unless you know what do do with it. And to get any experience, you need a playground. This is where having a private cloud manager comes in very handy indeed, giving you the ability to easily define a network range which is only for development and/or staging purposes and testing your code out on that instead of touching the production machines. Textbook stuff, made easy - and indeed that's what we've done at Meraka with the OpenNebula installation there.

Although we're far from done in preparing a set of playbooks which can be run by any site admin on the grid at their site, we have set up all the necessary tools:

a distributed source code repo which any site can fork, modify and if desired contribute back into the main branch - https://www.github.com/AAROC/ansible-for-grid
a safe development and staging environment for said code, where site admins can validate their updates and configurations without nuking anything of actual value... and if things go wrong, they can roll back to a previously tested and validated version

To be entirely honest, that last bit - the testing and validation - is missing at the moment. Indeed, the staging site will be receiving nagios probes so, but this is somewhat asynchronous and clunky. What we want to do quite soon is include DevOps tests in the Jenkins CI service under deployment at UFS so that any code commits to the Github repo automatically trigger a functional test of the relevant service. DevOps can then consult the Jenkins dashboard to know whether their code will nuke their site or not.

Let's say you want to reconfigure your site to enable MPI or GPGPU jobs. Following the YAIM documentation, you modify the vanilla playbooks to enable this in the CE, then commit the code back to the repo. Jenkins will see this, and execute the functional test associated with that playbook, which in this case is "deploy the site, update the top-bdii, wait for the site to show up in the top-bdii, send a job to the site, get the output and check whether the job ran properly". If the job didn't run properly, or something died before that, Jenkins would tell you and you would probably have saved a bunch of time and had a far deeper chat about the nature of reality with that weird dude who's always bringing these things up over coffee.

Now, you tell me that would not be rad. I dare you !

Robots in the clouds, everywhere; trust them.

Also, you tell me how you would do that without having access to some kind of flexible computing infrastructure. For each new test, you need to start pretty much from scratch - having a private cloud which you can interact with programmatically (ie, via some kind of API, preferably a standards-based on like OCCI) - makes your testing environment very powerful and scalable.

However, there is also the issue of special hardware and re-using resources. Take the example above of testing a new site with some fancy GPGPUs: before deploying the site into the grid, we'd like to be sure that our deployment code is going to work and that jobs will run. To do this, we would like to use the same code to test and deploy it (this is one of the things the Ansible guys keep harping on and I'm starting to get it). Now, consider that that fancy new hardware is at one site (say, UCT) and the testing service is at another (UFS). What's a sysadmin to do - entirely replicate one or the other ? Wouldn't it make sense to share and re-use services in the same way we do with resources on the grid, by federating the access to them and ensuring that they expose standard interfaces ? So, UCT can just write the tests for its fancy new kit without having to worry about the whole overhead actually running the CI service. Now you're thinking with portals ! (sorry).

Release

It's one thing to imagine how these tools - when they're properly implemented - will make everyone's life easier, but it's another thing entirely actually implement them. For one thing, how do we know when it's done ? How do we know that we can actually trust the code to not break our sites and so on ? Continuous Integration via Jenkins is of course part of the answer to this - so when our Ansible and/or Puppet code (site admins - remember that we need your contributions to the github repo !) is indeed tested with Jenkins we'll be able to tick off a big TODO.

However, how's a site somewhere to know that the code is ready for production ? Then there's the issue of what version of the code you're using to deploy your site; it's not as if we'll just hack away at it and then it'll be finished (god forbid!) - this is going to be a continuous (sorry) exercise.

The answer is, of course, trivial - the code for deploying production sites will be tagged as such in a branch and released. It's kinda corny, I know, in the 2010's (what are we calling this decade again ? - please don't say the twenteenies.), but that's the idea behind calling it "SAGrid-2.0" - because the code that was used to deploy it will have a version number which you can refer to, instead of what we usually do now which is go through a bunch of bash histories and try figure out what the **** happened !

Happy New Year !

So, that's it from the the coordinator. I'm looking forward to working with all of you during the year, implementing and improving the services we need to continue serving African and South African research communities. We'll have our first meeting before the end of the month. In the meantime, here's wishing a prosperous, efficient and productive year to all !