DoC Private Cloud: Staff Working Group Open Meeting: 25th April 2012
A wider group of DoC staff met up in room 418 on the 25th April 2012, to further discuss the proposed DoC private cloud. These comprise some notes made by LDK and DCW.
SUSAN introduced the meeting, PJM chaired it.
Steve Muggleton (SHM), Emil Lupu (ECL1), Peter Pietzuch (PRP), Iain Phillips (ICCP), Alex Wolf (ALW), Abbas Edalat (AE), Crisitian Cadar (CRISTIC), Tristan Allwood (TORA), Alistair Donaldson (AFD), Nobuko Yoshida (YOSHIDA), John Darlington (JD), Tony Field (AJF), Sophia Drossopolou (SCD), Jeremy Cohen (JHC02), Brian Fuchs (BFUCHS), Jeff Kramer (JK), Paul Kelly (PHJK), Philippa Gardner (PG), Petr Hosek (PH1310), Berc Rustem (BR), Natasa Przulj (NATASA), Francesca Toni (FT), Daniel Rueckert (DR), Julie McCann (JAMM), Anne O'Neill (AON), Duncan White (DCW), Giuseppe Mazza (GMAZZA), Lloyd Kamara (LDK).
PJM recapped the concept: DoC private 'Infrastructure as a service' cloud.
- web front-end for users to create VMs.
- configure memory/cpu/storage, and choose which CSG supported OS to install (eg. Ubuntu Linux or Windows OS).
- eventually, users will be able to install their own custom OS images.
- fast, stable storage system storing VM images and related data.
- accounting system for recording and (eventually limiting) resource-usage.
- guaranteed/reserved resources for VMs.
- users can access DoC home directories and shared (eg research) volume from VMs.
- users can be root on VMs and install packages.
- eventually we should be able to monitor power on compute nodes (servers running VMs), and VMs should be able to access GPUs, FPGAs etc on some hardware. but not at first.
partially driven by economics: EPSRC will now only fund 50% of equipment grants ("vanity clusters") > £10K, DoC has to fund remainder. NATASA: only EPSRC? EU still support 100% grants on large servers. PJM: yes, for now. Later, who knows.
- ALW: What is the view on security and firewalls to the outside world? Might like to control access from the outside world. PJM: Tricky because of ICT's position. However, ICT may accept this. Could parts of the cloud be "internally separated from rest of DoC"? If so, might have issues with accessing user home-directories. DCW post-meeting comment: surely the default is that each VM has same security constraints as a real server in DoC, vary from this by negotiation.
- SHM: His group usually buy dedicated high-power servers to run a particular task. How will cloud compare? PJM: Have envisaged that we can reserve/'pin' resources onto particular hardware for periods of time (eg a few months). Also, you're not forced to join the cloud, if you can get the money without Dept 50%.
- PHJK: ICT have very large HPC cluster, which DoC should use more. It has a batch-queuing system called PBSpro; his group bought 4 servers to add to the College HPC which are managed by ICT HPC team. Very successful. SUSAN: Tried ICT HPC: fast but not flexible: needed a lot of effort to install required software, but she agreed we should use it more due to all the money put into it. DCW added: that's £1 million per year HPC funding for last 3 years and next 3 years too.. SUSAN: we cannot and should not compete with that scale! PHJK: Some use cases are better served by College HPC rather than cloud. PJM: agreed, both are useful.
- SUSAN: Concerned about the fragility about existing DoC server purchase/install cycle: inflexible, takes too long, can be left with useful servers hanging around with old OS. Some don't upgrade because they don't know or are afraid to, not because they can't.
- ALW: Cloud is more flexible but there are still issues with supporting VMs with old OSes. PJM: Agreed. The cloud should half-solve the problem, by detaching the old OS from the hardware, can allow a research group (or researcher) to create a VM a year, each with the latest CSG supported Linux (Ubuntu 10.04 one year, 11.04 the next for example), then test their software on the new OS. When no software still requires the old OS, can (and should) decommission the old VM. CSG might also need to work with researchers to retire very old insecure OSes, as they do at present.
- JHC02: Policies will be needed to quiesce VMs when not in use, while not suspending long running jobs, and also prevent general accumulation of VMs over time. PJM: Agreed - will need policies, and accounting system will help with this.
- PHJK: Will VM images be backed-up? PJM: Perhaps - not to the same extent as the general storage space. This technical discussion will be looked at later. DCW post-meeting comment: CSG have never backed up VM images, perhaps there are cases (custom hand-crafted install for instance) where we should. But generally, not needed!
- ALW: Might this demotivate people from specc'ing equipment on their own proposals? PJM, SUSAN: Should increase motivation. PHJK: getting that 50% contribution from Dept would be a big motivation!
- PHJK: We aren't talking about VM on desktops, are we? PJM: No - at least not yet. Virtualising servers.
- DR: Storage is very important too - e.g. someone wants 10TB extra plus backups, rather than having to buy lots of stuff, lovely if we could virtualise this and allocate it on demand, not caring so much about hardware to realise this.
- PHJK: cost of storage hardware is cheap. Added costs from management, backups etc. ECL1: Can rent storage from ICT. DR: not at 10TB scale, and high performance yet, plus ICT prices quite expensive. DCW added: even when all PHJK's extra costs are added in, ICT's prices are still expensive.
- JD: Support this idea, and let's think of the bigger picture too - linking with EU projects, platforms and research. Susan: yes, interesting cloud research will be possible.
- PG: know very little of clouds, but want the local implementation done well. SUSAN: agreed want reliable production system, with more-effective server usage; fewer 'fragile old' servers. Aside from SUSAN: If you have an underused RA who is interested in sys admin, perhaps they could help upgrade some old cluster OSes?
- PJM: In summary, is anyone dead-against the cloud? there was no response.
- PJM: Ok, plan is to have a six-month first phase, hire an extra member of CSG, build the cloud. Second phase: if successful, research groups can choose to add new servers into the cloud, or potentially even some older ones, with the extra member of CSG becoming permanent. Rest of CSG will skill up on clouds too.
- PHJK: What proportion of people with dedicated research servers might need dedicated hardware (to do performance measurement etc)? Not many responses. DCW added: will still need 'raw metal' on occasion. No problem with that. PJM added: there is a small overhead (perhaps 5%) of running a VM on cloud as opposed to the bare metal.
- PHJK: Can a VM pinned to a server use that server's local hard-drive? PJM: Not easily. DCW: maybe spare chunk of local disk could be accessed directly from one VM - like direct access to GPU or FPGA?
- CRISTIC: What happens next, in more detail?
SUSAN: The Dept will put up a chunk of money in the next month or two, CSG will purchase fast hardware and fast disk storage. Commodity hardware and open source software preferred, but NetApp also being considered. More technical discussions needed to decide what to buy, what software to use etc.
- SCD: What happens in a year - all services in the cloud? PJM: We'll see. Unlikely for all services. Maybe some.
- SCD and PG: extra support load for CSG? PJM: some, but note that virtualisation is already being done in CSG (VMware ESX cluster and Windows Virtualisation on some new desktops).
JK: Seems this is the way things are going -> would make sense to go along. Seems to generally be good for the department. People getting equipment outside this scheme can still do so. Good idea!
- PG: Aldo couldn't be here, but asked if there can be slides and/or minutes of this meeting. PJM: Will look at this.
- PJM: In summary, major issues from this meeting are security and performance.
- ECL1: What are the risks? Proliferation of VMs, MAnagement - who to contact? Will it affect desktop procurement? SUSAN: won't affect desktop procurements at all - yet. Also, if it does not work out, we re-use the hardware; somewhat risk-free investment initially, and all from Dept funding. Biggest risk might be being too far ahead of the curve, find nothing works reliably!
- PJM: In later meetings, we will consider technical issues: cloud implementations, VM implementations etc. A working group (with PRP, WJK and DR on it) will be looking at this: you can contact them if you wish; want departmental consensus.