DoC Computing Support Group


Differences between revisions 1 and 36 (spanning 35 versions)
Revision 1 as of 2012-04-03 12:52:36
Size: 1634
Editor: dcw
Comment:
Revision 36 as of 2012-05-15 11:23:04
Size: 7375
Editor: dcw
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Wiki page for notes on Jan-April 2012 DoC private cloud discussions = ## page was renamed from internal/project/privatecloud
= DoC Private Cloud: 2012 =
Line 3: Line 4:
== Intro == == Project Goal ==
Line 5: Line 6:
Sometime in early 2012, Susan told DCW that DoC were thinking of hiring
someone (Jeremy Cohen) for 6 months into CSG, specifically tasked with
building a DoC private cloud [definition unclear].
Build a DoC '''Infrastructure-as-a-service private cloud, very like Amazon EC2''' ("Elastic Compute Service") which presents a ''secure and convenient
web interface'' which enables users of DoC to ''specify and create VMs and associated storage, automatically install OSes on them and deploy them''.
Line 9: Line 9:
She explained the core idea was "virtualisation even for research clusters",
i.e. research groups currently buy clusters when they have money, CSG set
them up, install "linux du jour" on them, configure fileservers (if part of
cluster), tape backups (if part), processing node special software etc.
The main goal is to virtualize most research servers, decoupling the OS image from the hardware for greater flexibility. Sharing (amortizing) the
costs of each machine. One driver of this is EPSRC deciding to only provide 50% of any hardware bid over £10K in future, with the Dept
expected to pay the remaining 50%.
Line 14: Line 13:
Then the servers age, the OS is essentially frozen (it's often difficult to
persuade researchers that we should reinstall their fileservers, webservers
and compute nodes). They become "fragile". Sometimes it's hard to even
retire them on schedule (4/5/6 years or whatever). Usually, these clusters
are only accesible by members of that research group so the resource may
not be fully utilised.
This project will definitely proceed, having been approved by Executive Committee and by two open meetings of Academic staff.
Line 21: Line 15:
Susan's vision: setup a private cloud, researchers add hardware to that Peter McBrien (PJM) is leading the project, and has laid out two stages:

 1. a 6 month phase in which CSG (advised by an academic working group) will design and build a prototype cloud, recruiting a "Cloud Manager" person to join CSG, possibly for 6 months in the first instance. The Department will spend some significant amount of money to build the prototype cloud, perhaps in the £100-200K range.

 2. assuming the prototype cloud is successful, it will move into production and the "Cloud Manager" become permanent. Researchers would then be encouraged to add research-funded hardware to the cloud and given some form of preferential treatment on "their hardware". All members of CSG are enthusiastic to gain cloud-related skills from the "Cloud Manager". (By the way, the Cloud Manager will do non-cloud-related systems administration too).

Most crucially: (despite not knowing the exact set of services to provide, let alone how to implement them, or having yet appointed a Cloud Manager person)
the Department's substantial initial investment '''must be fully spent before the end of July'''. This means that all kit must be ordered, delivered and paid for before the 31st of July.
With the Olympics making deliveries difficult, this means that everything must have been ordered by 1st July.

Re: amount to spend, Anne O'Neill (AON) and Peter McBrien (PJM) suggested CSG prepare possible collections of commodity hardware, on the
assumptions that either £100K, £150K and £200K (inc vat) would be spent.

== The Problem We're Trying to Solve ==

At present, research groups buy clusters when they have money, CSG set
them up, install the current supported Linux or Windows release on them
(the CSG supported Linux release currently changes each year), optionally
configuring storage and fileserver nodes, arranging tape backups of important
data, adding special software etc.

Important note: CSG do '''not currently backup physical server disk images''',
instead we attempt to determine from the owners what data is IMPORTANT and
back that up instead.

Then the servers age, after the first year the OS becomes essentially frozen
apart from minor security updates. It's often difficult to persuade researchers
that we should reinstall their fileservers, webservers and compute nodes.
They become "fragile", and eventually a security risk.

Sometimes it's hard to retire them when the hardware becomes more than 4-5 years
old, because of the "fragile" software setup on them.

A second problem is that these clusters are often only accessible by members of the
specific research group that bought them, so the resource may not be fully utilised.

Instead, the idea is to setup a private cloud, researchers add hardware to that
Line 23: Line 53:
tied (1-1 at first) to their own hardware, CSG install that virtual cluster
node (VCN)'s OS, researchers work as before - but each node's encapsulated
inside a VM. Later, these VMs could share resources - when the group don't
need 100% resources, or new more powerful hardware is purchased.
tied (1-1 at first) to their own hardware, the creation process should
automatically install a CSG-supported operating system (historically
supported Linuxes and Windows versions) on the new VM. Researchers work as before on each VM -
but each node is encapsulated inside a VM.
Line 28: Line 58:
Suppose, for instance, the group needed N nodes x 100% of underlying VM host
x M months [and then less thereafter].
Later, these VMs could share resources - when the group don't need 100% resources, or new more
powerful hardware is purchased and the VM migrated to it.
Line 31: Line 61:
Susan also added "and it should just scale, manage itself magically." We would also gain to flexibility to create short-term VMs for specific "run this software on 16 nodes"
experiments. Such short-term VMs would then be automatically destroyed.

We could even give every DoC user (students and staff!) their very own
VM when they join, with full root/admin access - or at least the
ability to create one when they first need it (lazy evaluation:-)).

== Open Staff cloud meetings ==

In April 2012, the discussion was opened out to all interested staff, and (so far) two open staff cloud meetings
have been held. Here are some notes taken by DCW and LDK of the discussions at both meetings.

[[project/privatecloud/meeting-2012-04-03|Open Staff Meeting 1 - April 3rd 2012]]

[[project/privatecloud/meeting-2012-04-25|Open Staff Meeting 2 - April 25th 2012]]

== Cloud Services and Software Investigations ==

CSG have been performing initial investigations of hardware to buy, and possible software that
might help with the project.

== Cloud Hardware ==

Susan expressed a preference for open source software running on commodity hardware.
A strong advantage of this approach is that commodity hardware could be repurposed, even if the
cloud project failed completely, or got scaled down to use a tiny proportion of the hardware.

However, Peter Pietzuch recommended that we also investigate commercial scalable storage systems -
specifically NetApp "scalable NAS" units - which his colleagues at the University of Cambridge
Computing Laboratory have used for years. CSG have discussed NetApps with colleagues in ICT
and Martyn Johnson at the Cambridge Computing Lab, are investigating these and attempting to
get possible prices for (perhaps) suitable models.

Here's the results of our hardware investigations:

[[project/privatecloud/hardware|Hardware Investigations]]

Lloyd Kamara (LDK) has done a lot of work on getting quotes for possible commodity-hardware
Compute and Storage nodes (conventional Dell/HP/IBM rack mount servers with, respectively,
as many cores and as many disks as possible per unit cost). His results are here:

[[project/privatecloud/hardware/commodity-compute|Commodity Hardware: Compute Nodes]]

[[project/privatecloud/hardware/commodity-storage|Commodity Hardware: Storage]]

and Giuseppe and Duncan have been getting NetApp information and quotes, information coming here
in the next hour or so:

[[project/privatecloud/hardware/netapp-storage|NetApp: Storage (alternative)]]

== Software Investigations ==

CSG have been familiarising themselves with various possible open source cloud or storage software
systems that might be able to implement some/all of the required iaas cloud services, and performing
some initial investigations of a few of them. While the Cloud Manager will of course be responsible
for designing and building the cloud, existing members of CSG are concerned to '''map the terrain'''
to find out where the dragons are lurking and to provide an '''existence proof''' to reduce the risk
that after buying the hardware, no software can be added to build the desired cloud.

Here are our notes:

[[project/privatecloud/investigations|Software Investigations]]

DoC Private Cloud: 2012

Project Goal

Build a DoC Infrastructure-as-a-service private cloud, very like Amazon EC2 ("Elastic Compute Service") which presents a secure and convenient web interface which enables users of DoC to specify and create VMs and associated storage, automatically install OSes on them and deploy them.

The main goal is to virtualize most research servers, decoupling the OS image from the hardware for greater flexibility. Sharing (amortizing) the costs of each machine. One driver of this is EPSRC deciding to only provide 50% of any hardware bid over £10K in future, with the Dept expected to pay the remaining 50%.

This project will definitely proceed, having been approved by Executive Committee and by two open meetings of Academic staff.

Peter McBrien (PJM) is leading the project, and has laid out two stages:

  1. a 6 month phase in which CSG (advised by an academic working group) will design and build a prototype cloud, recruiting a "Cloud Manager" person to join CSG, possibly for 6 months in the first instance. The Department will spend some significant amount of money to build the prototype cloud, perhaps in the £100-200K range.
  2. assuming the prototype cloud is successful, it will move into production and the "Cloud Manager" become permanent. Researchers would then be encouraged to add research-funded hardware to the cloud and given some form of preferential treatment on "their hardware". All members of CSG are enthusiastic to gain cloud-related skills from the "Cloud Manager". (By the way, the Cloud Manager will do non-cloud-related systems administration too).

Most crucially: (despite not knowing the exact set of services to provide, let alone how to implement them, or having yet appointed a Cloud Manager person) the Department's substantial initial investment must be fully spent before the end of July. This means that all kit must be ordered, delivered and paid for before the 31st of July. With the Olympics making deliveries difficult, this means that everything must have been ordered by 1st July.

Re: amount to spend, Anne O'Neill (AON) and Peter McBrien (PJM) suggested CSG prepare possible collections of commodity hardware, on the assumptions that either £100K, £150K and £200K (inc vat) would be spent.

The Problem We're Trying to Solve

At present, research groups buy clusters when they have money, CSG set them up, install the current supported Linux or Windows release on them (the CSG supported Linux release currently changes each year), optionally configuring storage and fileserver nodes, arranging tape backups of important data, adding special software etc.

Important note: CSG do not currently backup physical server disk images, instead we attempt to determine from the owners what data is IMPORTANT and back that up instead.

Then the servers age, after the first year the OS becomes essentially frozen apart from minor security updates. It's often difficult to persuade researchers that we should reinstall their fileservers, webservers and compute nodes. They become "fragile", and eventually a security risk.

Sometimes it's hard to retire them when the hardware becomes more than 4-5 years old, because of the "fragile" software setup on them.

A second problem is that these clusters are often only accessible by members of the specific research group that bought them, so the resource may not be fully utilised.

Instead, the idea is to setup a private cloud, researchers add hardware to that cloud's core resources, then create a VM for each cluster node, perhaps tied (1-1 at first) to their own hardware, the creation process should automatically install a CSG-supported operating system (historically supported Linuxes and Windows versions) on the new VM. Researchers work as before on each VM - but each node is encapsulated inside a VM.

Later, these VMs could share resources - when the group don't need 100% resources, or new more powerful hardware is purchased and the VM migrated to it.

We would also gain to flexibility to create short-term VMs for specific "run this software on 16 nodes" experiments. Such short-term VMs would then be automatically destroyed.

We could even give every DoC user (students and staff!) their very own VM when they join, with full root/admin access - or at least the ability to create one when they first need it (lazy evaluation:-)).

Open Staff cloud meetings

In April 2012, the discussion was opened out to all interested staff, and (so far) two open staff cloud meetings have been held. Here are some notes taken by DCW and LDK of the discussions at both meetings.

Open Staff Meeting 1 - April 3rd 2012

Open Staff Meeting 2 - April 25th 2012

Cloud Services and Software Investigations

CSG have been performing initial investigations of hardware to buy, and possible software that might help with the project.

Cloud Hardware

Susan expressed a preference for open source software running on commodity hardware. A strong advantage of this approach is that commodity hardware could be repurposed, even if the cloud project failed completely, or got scaled down to use a tiny proportion of the hardware.

However, Peter Pietzuch recommended that we also investigate commercial scalable storage systems - specifically NetApp "scalable NAS" units - which his colleagues at the University of Cambridge Computing Laboratory have used for years. CSG have discussed NetApps with colleagues in ICT and Martyn Johnson at the Cambridge Computing Lab, are investigating these and attempting to get possible prices for (perhaps) suitable models.

Here's the results of our hardware investigations:

Hardware Investigations

Lloyd Kamara (LDK) has done a lot of work on getting quotes for possible commodity-hardware Compute and Storage nodes (conventional Dell/HP/IBM rack mount servers with, respectively, as many cores and as many disks as possible per unit cost). His results are here:

Commodity Hardware: Compute Nodes

Commodity Hardware: Storage

and Giuseppe and Duncan have been getting NetApp information and quotes, information coming here in the next hour or so:

NetApp: Storage (alternative)

Software Investigations

CSG have been familiarising themselves with various possible open source cloud or storage software systems that might be able to implement some/all of the required iaas cloud services, and performing some initial investigations of a few of them. While the Cloud Manager will of course be responsible for designing and building the cloud, existing members of CSG are concerned to map the terrain to find out where the dragons are lurking and to provide an existence proof to reduce the risk that after buying the hardware, no software can be added to build the desired cloud.

Here are our notes:

Software Investigations

 
 

project/privatecloud (last edited 2013-11-13 19:27:43 by dcw)