DoC Computing Support Group


DoC Private Cloud: 2012 - 2013

Project Goal

In early summer 2012, CSG were tasked with building a DoC Infrastructure-as-a-service private cloud, very like Amazon EC2 ("Elastic Compute Service") which presents a secure and convenient web interface which enables users of DoC to specify and create VMs and associated storage, automatically install OSes on them and deploy them.

The main goal is to virtualize most research servers, decoupling the OS image from the hardware for greater flexibility. Sharing (amortizing) the costs of each machine. One driver of this is EPSRC deciding to only provide 50% of any hardware bid over £10K in future, with the Dept expected to pay the remaining 50%.

This project was approved by Executive Committee and by two open meetings of Academic staff. Peter McBrien (PJM) led the project, and laid out two stages:

  1. a 6 month phase in which CSG (advised by an academic working group) will design and build a prototype cloud, recruiting a "Cloud Manager" person to join CSG, possibly for 6 months in the first instance. The Department will spend some significant amount of money to build the prototype cloud, perhaps in the £100-200K range.
  2. assuming the prototype cloud is successful, it will move into production and the "Cloud Manager" become permanent. Researchers would then be encouraged to add research-funded hardware to the cloud and given some form of preferential treatment on "their hardware". All members of CSG are enthusiastic to gain cloud-related skills from the "Cloud Manager".

Most crucially: The Department decided to make a substantial initial investment - and it had to be spent before the end of July 2012. All kit was ordered, delivered and paid for before the 31st July 2012. Nearly £300K inc vat was spent on the project. The Cloud Manager, Thomas Joseph, was appointed about a year later - in July 2013, and has proceeded rapidly to construct the first iteration of the DoC Cloud.

The Problem We're Trying to Solve

At present, research groups buy clusters when they have money, CSG set them up, install the current supported Linux or Windows release on them (the CSG supported Linux release currently changes each year), optionally configuring storage and fileserver nodes, arranging tape backups of important data, adding special software etc.

Then the servers age, after the first year the OS becomes essentially frozen apart from minor security updates. It's often difficult to persuade researchers that we should reinstall their fileservers, webservers and compute nodes. They become "fragile", and eventually a security risk.

Sometimes it's hard to retire them when the hardware becomes more than 4-5 years old, because of the "fragile" software setup on them.

A second problem is that these clusters are often only accessible by members of the specific research group that bought them, so the resource may not be fully utilised.

Instead, the idea is to setup a private cloud, researchers add hardware to that cloud's core resources, then create a VM for each cluster node, perhaps tied (1-1 at first) to their own hardware, the creation process should automatically install a CSG-supported operating system (historically supported Linuxes and Windows versions) or a non-CSG supported "standalone" operating system on the new VM. Researchers work as before on each VM - but each node is encapsulated inside a VM.

Later, these VMs could share resources - when the group don't need 100% resources, or new more powerful hardware is purchased and the VM migrated to it.

We would also gain to flexibility to create short-term VMs for specific "run this software on 16 nodes" experiments. A fleet of such short-term VMs might be created today, run for a couple of days, and then be destroyed at the end of the experiment.

We could even give every DoC user (students and staff!) their very own VM when they join, with full root/admin access - or at least the ability to create one when they first need it (lazy evaluation:-)).

Open Staff cloud meetings

In April 2012, the discussion was opened out to all interested staff, and (so far) two open staff cloud meetings have been held. Here are some notes taken by DCW and LDK of the discussions at both meetings.

Open Staff Meeting 1 - April 3rd 2012

Open Staff Meeting 2 - April 25th 2012

Summer 2013: Cloud Access URL

The end-user interface for the DoC Private Cloud is now available for departmental users via cloudstack.doc.ic.ac.uk/client. Please use your normal college user-name and password for authentication; the domain should be imperial.

Summer 2012: Cloud Hardware we bought

Here is the hardware we have bought for the cloud. More can be added later (eg. by research groups opting in):

  • 4 x Dell PowerEdge C6220 compute servers. This is a very dense compute server, with four independent nodes in a two unit chassis. Each node contains two Intel Xeon E5-2690 8-core 2.9GHz processors (32 threads with hyper-threading), 128GB of RAM and two 1TB hard drives.

  • 2 x IBM System x3750 M4. Each server has four Intel Xeon E5-4650 8-core 2.7GHz processors (64 threads with hyper-threading), 512GB of RAM, two 300GB hard drives and twelve 1TB hard-drives.

  • 4 x Dell PowerEdge R720. Each server has two Intel Xeon E5-2640 2.50GHz six-core 2.5Ghz processors (24 threads with hyper-threading), 64GB of RAM, two 300GB hard drives and 24 1TB hard-drives.

  • 1 x NetApp NetApp F2240A-2 dual-controller Filer and disk-shelf; raw storage capacity 60TB.

  • 4 x Extreme Summit X670 10GbE switches; these form 2 pairs of switches, one pair in the DoC machine room (Huxley) and the other pair to be installed in the ICT machine room (MechEng).

We identified two types of server for the DoC private cloud: a compute node and a storage node:

  • A compute node contains a large number of CPUs/cores. Its primary role in the cloud is one of computation (virtual machine hosting, distributed computing and the like). The Dell C6220 and IBM 3750s mentioned above are variant types of compute nodes.

  • A storage node contains a large number of locally attached disks providing a chunk of fault tolerant storage. Its primary role in the cloud is to provide storage (for VM images and associated research filesystems). The Dell R720s and the NetApp are both storage heavy nodes.

We envisage that multiple compute nodes and multiple storage nodes would be needed. Here are our old notes:

Hardware Investigations

Software Investigations

CSG have been familiarising themselves with various possible open source cloud or storage software systems that might be able to implement some/all of the required IaaS cloud services, and performing some initial investigations of a few of them. While the Cloud Manager will of course be responsible for designing and building the cloud, existing members of CSG are concerned to map the terrain to find out where the dragons are lurking and to provide an existence proof to reduce the risk that after buying the hardware, no software can be added to build the desired cloud.

Here are our notes:

Software Investigations

 
 

project/privatecloud (last edited 2013-11-13 19:27:43 by dcw)