Hello, Operator - What is Your Emergency ?: January 2014

In this issue: What have we been up to from October to December ?; The end of year collapse (ie, Christmas for sysadmins); updates on our executable infrastructure; thoughts on collaboration; why it's called SAGrid-2.0; and happy new year !

Collapse

TL;DR - the end of the academic year is hell, but then it's ok. for about a week.

One of the ironies of writing updates for an active community is that the more activity there is, the less time there is to write about it and keep all the interested parties informed. The last few months, from October to December 2013, were - at least for me - a blur of meetings, planning, conferences, and other activities, including some development work here and there for new services on the grid.

It was indeed an epic mission, kicked off by the first (and ambitiously-titled) e-Research Africa conference, organised by ASAUDIT and well-attended by members of several Australasian and European e-research infrastructures. From our side, we gave a good series of presentations on the Regional Operations Centre and SAGrid-2.0. There was also a hefty contingent of presentations from the CHAIN-REDS project, presenting work done on the science gateways, semantic search of data and document repositories, and the interoperability of data and computing infrastructures. All of this has led to some interesting new links between the CHAIN-REDS project and the University of Cape Town.

Renewal.

TL;DR - A new HPC forum is created. SAGrid-2.0 is taking shape; First Release 02/14

HPC Forum is created

Without further ado, a quick reminder that a forum for HPC system adminstrators has been created at the e-Research Africa conference and held its first meeting at the CHPC conference in December in Cape Town. It's a place specifically for HPC sysadmins to get together and share their experience and issues, and is supported by the AfricaGrid Regional Operations Centre (via the CHAIN-REDS project for now). Anyone interested in joining can get hold of Peter van Heusden at SANBI.

Raising our game

We're written about this before, but a major theme for 2014 is renewal. We started this project in 2009 to build a federated, distributed computing infrastructure for South African research communities, based on the paradigm of grid computing. That paradigm itself has seen fundamental changes - not only in the technology which is used to implement it, but also - and perhaps more importantly - in the community and business models that are being adopted to operate it.
Thanks to the closer collaboration with EGI.eu via the research infrastructure provider MoU's that the CSIR has signed with them, we are able to better understand and hence adopt best practice for distributed computing infrastructures. This also means that our inadequacies as an infrastructure and as resource-providing sites are put into stark relief, since we have to publish our availability and reliability (amongst other things, such as accounting data) to EGI.eu. We'll just have to overlook the fact that during December, as an infrastructure our A/R was close to 3%... ouch. Yes, people were on holiday, but no, that's not an excuse. We'll have to make great strides towards improving this over the next few weeks, especially at the site level - only 2 of 6 sites are fully functional and passing tests. This is entirely unacceptable and while all our site admins need to take some of the responsibility for this, the final onus is on the Regional Operator on Duty (ROD). Clearly this role is not yet mature and needs some work. Documentation, Training, Certification.

Robots' running on the stage

Wouldn't it be great if we could eliminate humans from our operations entirely ? I mean... just wind the grid up and watch it go ? That would be rad. No, I submit that that would be très rad.

Avid readers (ha ha) will recall that we started some work on this back in 2013, forking the GRNET repo for executable infrastructure on github. This is a set of ansible playbooks and other code for deploying UMD or EMI middleware from the OS up. Of course, having a robot like Ansible is no good for you unless you know what do do with it. And to get any experience, you need a playground. This is where having a private cloud manager comes in very handy indeed, giving you the ability to easily define a network range which is only for development and/or staging purposes and testing your code out on that instead of touching the production machines. Textbook stuff, made easy - and indeed that's what we've done at Meraka with the OpenNebula installation there.

Although we're far from done in preparing a set of playbooks which can be run by any site admin on the grid at their site, we have set up all the necessary tools:

a distributed source code repo which any site can fork, modify and if desired contribute back into the main branch - https://www.github.com/AAROC/ansible-for-grid
a safe development and staging environment for said code, where site admins can validate their updates and configurations without nuking anything of actual value... and if things go wrong, they can roll back to a previously tested and validated version

To be entirely honest, that last bit - the testing and validation - is missing at the moment. Indeed, the staging site will be receiving nagios probes so, but this is somewhat asynchronous and clunky. What we want to do quite soon is include DevOps tests in the Jenkins CI service under deployment at UFS so that any code commits to the Github repo automatically trigger a functional test of the relevant service. DevOps can then consult the Jenkins dashboard to know whether their code will nuke their site or not.

Let's say you want to reconfigure your site to enable MPI or GPGPU jobs. Following the YAIM documentation, you modify the vanilla playbooks to enable this in the CE, then commit the code back to the repo. Jenkins will see this, and execute the functional test associated with that playbook, which in this case is "deploy the site, update the top-bdii, wait for the site to show up in the top-bdii, send a job to the site, get the output and check whether the job ran properly". If the job didn't run properly, or something died before that, Jenkins would tell you and you would probably have saved a bunch of time and had a far deeper chat about the nature of reality with that weird dude who's always bringing these things up over coffee.

Now, you tell me that would not be rad. I dare you !

Robots in the clouds, everywhere; trust them.

Also, you tell me how you would do that without having access to some kind of flexible computing infrastructure. For each new test, you need to start pretty much from scratch - having a private cloud which you can interact with programmatically (ie, via some kind of API, preferably a standards-based on like OCCI) - makes your testing environment very powerful and scalable.

However, there is also the issue of special hardware and re-using resources. Take the example above of testing a new site with some fancy GPGPUs: before deploying the site into the grid, we'd like to be sure that our deployment code is going to work and that jobs will run. To do this, we would like to use the same code to test and deploy it (this is one of the things the Ansible guys keep harping on and I'm starting to get it). Now, consider that that fancy new hardware is at one site (say, UCT) and the testing service is at another (UFS). What's a sysadmin to do - entirely replicate one or the other ? Wouldn't it make sense to share and re-use services in the same way we do with resources on the grid, by federating the access to them and ensuring that they expose standard interfaces ? So, UCT can just write the tests for its fancy new kit without having to worry about the whole overhead actually running the CI service. Now you're thinking with portals ! (sorry).

Release

It's one thing to imagine how these tools - when they're properly implemented - will make everyone's life easier, but it's another thing entirely actually implement them. For one thing, how do we know when it's done ? How do we know that we can actually trust the code to not break our sites and so on ? Continuous Integration via Jenkins is of course part of the answer to this - so when our Ansible and/or Puppet code (site admins - remember that we need your contributions to the github repo !) is indeed tested with Jenkins we'll be able to tick off a big TODO.

However, how's a site somewhere to know that the code is ready for production ? Then there's the issue of what version of the code you're using to deploy your site; it's not as if we'll just hack away at it and then it'll be finished (god forbid!) - this is going to be a continuous (sorry) exercise.

The answer is, of course, trivial - the code for deploying production sites will be tagged as such in a branch and released. It's kinda corny, I know, in the 2010's (what are we calling this decade again ? - please don't say the twenteenies.), but that's the idea behind calling it "SAGrid-2.0" - because the code that was used to deploy it will have a version number which you can refer to, instead of what we usually do now which is go through a bunch of bash histories and try figure out what the **** happened !

Happy New Year !

So, that's it from the the coordinator. I'm looking forward to working with all of you during the year, implementing and improving the services we need to continue serving African and South African research communities. We'll have our first meeting before the end of the month. In the meantime, here's wishing a prosperous, efficient and productive year to all !

Hello, Operator - What is Your Emergency ?

Monday, 6 January 2014

Renew, Refactor, Release, Repeat. Now you are Ready.