|
Size: 6462
Comment:
|
Size: 7818
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 1: | Line 1: |
| = DoC Private Cloud = | ## page was renamed from internal/project/privatecloud = DoC Private Cloud: 2012 - 2013 = |
| Line 3: | Line 4: |
| == Services == | == Project Goal == |
| Line 5: | Line 6: |
| Initially, something like the following services will be needed: | Build a DoC '''Infrastructure-as-a-service private cloud, very like Amazon EC2''' ("Elastic Compute Service") which presents a ''secure and convenient web interface'' which enables users of DoC to ''specify and create VMs and associated storage, automatically install OSes on them and deploy them''. |
| Line 7: | Line 9: |
| * Virtual-machine hosting / automated provisioning facility (Amazon EC20-a-like). * Persistent backing-store for VM images (perhaps Amazon S3-style, perhaps Amazon Elastic Block Service-style, perhaps distributed filesystem). * High-performance POSIX file-store access / scratch areas (perhaps in-cloud or perhaps separate from the cloud). |
The main goal is to virtualize most research servers, decoupling the OS image from the hardware for greater flexibility. Sharing (amortizing) the costs of each machine. One driver of this is EPSRC deciding to only provide 50% of any hardware bid over £10K in future, with the Dept expected to pay the remaining 50%. |
| Line 11: | Line 13: |
| In relation to scalable storage options, we've already been thinking about these for a while; see also: [[internal/project/Storage-NG|Storage-NG]]. | This project will definitely proceed, having been approved by Executive Committee and by two open meetings of Academic staff. |
| Line 13: | Line 15: |
| Candidate software to investigate, solving all or part of the private iaas cloud problem, includes: | Peter McBrien (PJM) is leading the project, and has laid out two stages: |
| Line 15: | Line 17: |
| * OpenStack: claims to do most of the job - VM provisioning, interface with KVM, Xen etc (subproject "Nova") and S3-alike storage system (project "Swift"). Also can integrate to (eg. Ceph, GlusterFS) for filesystem-based distributed storage. Also, Nova has an iSCSI client storage plugin too. | 1. a 6 month phase in which CSG (advised by an academic working group) will design and build a prototype cloud, recruiting a "Cloud Manager" person to join CSG, possibly for 6 months in the first instance. The Department will spend some significant amount of money to build the prototype cloud, perhaps in the £100-200K range. |
| Line 17: | Line 19: |
| * CloudStack: Citrix open source project, donated to Apache Foundation, rather similar to the above. | 2. assuming the prototype cloud is successful, it will move into production and the "Cloud Manager" become permanent. Researchers would then be encouraged to add research-funded hardware to the cloud and given some form of preferential treatment on "their hardware". All members of CSG are enthusiastic to gain cloud-related skills from the "Cloud Manager". (By the way, the Cloud Manager will do non-cloud-related systems administration too). |
| Line 19: | Line 21: |
| * VM seed/runtime image storage: * [[http://openstack.org/projects/storage/|OpenStack Swift]] distributed object store. (Implements Amazon S3 only) * [[http://www.osrg.net/sheepdog/|Sheepdog]] distributed image storage. * High-performance distributed POSIX file-stores: * [[http://ceph.newdream.net/|Ceph]] distributed object store / block-device / filesystem. DWM evaluated: Conclusion - not ready yet. * [[http://www.gluster.org/|Gluster]] distributed filesystem. DWM evaluating. Looking rather promising. No "storage head node", i.e. bottleneck! * [[http://www.moosefs.org/|MooseFS]] distributed filesystem. * [[http://www.fhgfs.com/cms/|Fraunhofer Parallel File System]] distributed filesystem, no storage head node, BUT NO REPLICATION, pure striping. Discovered by SM. Conclusion: not suitable. |
Most crucially: (despite not knowing the exact set of services to provide, let alone how to implement them, or having yet appointed a Cloud Manager person) the Department's substantial initial investment '''must be fully spent before the end of July'''. This means that all kit must be ordered, delivered and paid for before the 31st of July. With the Olympics making deliveries difficult, this means that everything must have been ordered by 1st July. |
| Line 29: | Line 25: |
| At present, the options for scalable high-performance general purpose NFS filers suitable for storing home dirs, research volumes etc are either expensive or not yet mature. The above POSIX filestores may well be suitable for the "inner problem" of VM storage. |
Re: amount to spend, Anne O'Neill (AON) and Peter McBrien (PJM) suggested CSG prepare possible collections of commodity hardware, on the assumptions that either £100K, £150K and £200K (inc vat) would be spent. |
| Line 32: | Line 28: |
| Upgrading existing NFS fileservers to 10Gb is definitely also worth trying (first). | == The Problem We're Trying to Solve == |
| Line 34: | Line 30: |
| * Virtualization software: * [[http://www.xen.org/|Xen]] paravirtualization tools. * [[http://www.linux-kvm.org/|KVM]] (para)-virtualization tools. * VMware ESX virtualization software - commercial. * [[http://libvirt.org/|libvirt]] VM abstraction and management layer. * [[http://code.google.com/p/ganeti/|Ganeti]] VM management system. * [[http://openstack.org/projects/compute/|OpenStack Nova]] VM management system [uses Xen, KVM or VMware under the hood]. |
At present, research groups buy clusters when they have money, CSG set them up, install the current supported Linux or Windows release on them (the CSG supported Linux release currently changes each year), optionally configuring storage and fileserver nodes, arranging tape backups of important data, adding special software etc. |
| Line 42: | Line 36: |
| The virtual-machine management layer will need to support accounting for resource utilization by the VMs spawned for a given user or group, live migration of VMs from one host to another, and will likely need to support automated backups / snapshots of at least SOME historical virtual-machine disk state. (Note that this differs from existing doctrine, which specifies that the machine-local OS data is expendable, and can be regenerated.) | Then the servers age, after the first year the OS becomes essentially frozen apart from minor security updates. It's often difficult to persuade researchers that we should reinstall their fileservers, webservers and compute nodes. They become "fragile", and eventually a security risk. |
| Line 44: | Line 41: |
| The use of seed images, data de-duplication, and/or copy-on-write would also be valuable for minimising storage requirements. | Sometimes it's hard to retire them when the hardware becomes more than 4-5 years old, because of the "fragile" software setup on them. |
| Line 46: | Line 44: |
| == Background == | A second problem is that these clusters are often only accessible by members of the specific research group that bought them, so the resource may not be fully utilised. |
| Line 48: | Line 47: |
| Sometime in early 2012, Susan told DCW that DoC were thinking of hiring someone for 6 months into CSG, specifically tasked with building a DoC private cloud. Essentially she said that Exec Committee has found some significant pot of money which needs to be spent this financial year. |
Instead, the idea is to setup a private cloud, researchers add hardware to that cloud's core resources, then create a VM for each cluster node, perhaps tied (1-1 at first) to their own hardware, the creation process should automatically install a CSG-supported operating system (historically supported Linuxes and Windows versions) on the new VM. Researchers work as before on each VM - but each node is encapsulated inside a VM. |
| Line 53: | Line 54: |
| She explained the core idea was "virtualisation even for research clusters": at present, research groups buy clusters when they have money, CSG set them up, install "linux du jour" on them, configure fileservers (if part of cluster), tape backups (if part), processing node special software etc. |
Later, these VMs could share resources - when the group don't need 100% resources, or new more powerful hardware is purchased and the VM migrated to it. |
| Line 58: | Line 57: |
| Then the servers age, the OS is essentially frozen (it's often difficult to persuade researchers that we should reinstall their fileservers, webservers and compute nodes). They become "fragile". Sometimes it's hard to even retire them on schedule (4/5/6 years or whatever). Also these clusters are often only accessible by members of that research group so the resource may not be fully utilised. |
We would also gain to flexibility to create short-term VMs for specific "run this software on 16 nodes" experiments. Such short-term VMs would then be automatically destroyed. |
| Line 65: | Line 60: |
| Susan's vision: setup a private cloud, researchers add hardware to that cloud's core resources, then create a VM for each cluster node, perhaps tied (1-1 at first) to their own hardware, CSG install that virtual cluster node's OS, researchers work as before - but each node is encapsulated inside a VM. Later, these VMs could share resources - when the group don't need 100% resources, or new more powerful hardware is purchased. |
We could even give every DoC user (students and staff!) their very own VM when they join, with full root/admin access - or at least the ability to create one when they first need it (lazy evaluation:-)). |
| Line 72: | Line 64: |
| Various discussions with PJM and AON followed. Project will definitely proceed. In two stages: |
== Open Staff cloud meetings == |
| Line 75: | Line 66: |
| 1. a 6 month phase to build a prototype cloud, recruiting a "Cloud Manager" person to join CSG, either temporary or permanent. The Department will spend some significant amount of money, perhaps in the £100-200K range. | In April 2012, the discussion was opened out to all interested staff, and (so far) two open staff cloud meetings have been held. Here are some notes taken by DCW and LDK of the discussions at both meetings. |
| Line 77: | Line 69: |
| 2. assuming the prototype cloud is successful, it will move into production and the "Cloud Manager" become permanent. Researchers then have the option of adding hardware to the cloud. All members of CSG will become skilled in cloud-related topics, and the "Cloud Manager" will do non-cloud-related problem solving too. | [[project/privatecloud/meeting-2012-04-03|Open Staff Meeting 1 - April 3rd 2012]] |
| Line 79: | Line 71: |
| Most crucially: (despite not knowing the exact spec, services to provide, let alone how to implement them) we therefore need to purchase all the kit having it delivered in July 2012, before the Olympics. PJM added "build a private cloud like Amazon EC2 does", AON suggested a budget of £100K, £150K or even £200K - we will provide possible plans for these price levels. |
[[project/privatecloud/meeting-2012-04-25|Open Staff Meeting 2 - April 25th 2012]] |
| Line 86: | Line 73: |
| DWM has spent a lot of time evaluating Ceph as a possible S3/Elastic Block Store like storage system for supporting VM storage and possibly very high speed filesystems eg. staging areas for VM data (scaleout NAS with replication). So far: it's not there yet, at least as a fast POSIX filesystem. Alternatives need to be looked at as well.. |
== Cloud Access URL == |
| Line 92: | Line 75: |
| == Staff Working Group meetings == | The end-user interface for the DoC Private Cloud is now available for departmental users via [[http://cloudstack.doc.ic.ac.uk/client|cloudstack.doc.ic.ac.uk/client]]. Please use your normal college user-name and password for authentication; the domain should be ''imperial''. |
| Line 94: | Line 78: |
| [[internal/project/privatecloud/meeting-2012-04-03|Staff Working Group Meeting 1 - April 3rd 2012]] | == Cloud Hardware we bought == |
| Line 96: | Line 80: |
| [[internal/project/privatecloud/meeting-2012-04-25|Staff Working Group/Open Meeting 2 - April 25th 2012]] | Susan expressed a preference for open source software running on commodity hardware. A strong advantage of this approach is that commodity hardware could be repurposed, even if the cloud project failed completely, or got scaled down to use a tiny proportion of the hardware. Peter Pietzuch recommended that we also investigate commercial scalable storage systems - specifically NetApp "scalable NAS" units - which his colleagues at the University of Cambridge Computing Laboratory have used for years. CSG discussed NetApps with colleagues in ICT and Martyn Johnson at the Cambridge Computing Lab and then bought one. Here is the hardware we have bought for the cloud. More can be added later (eg. by research groups opting in): . 4 x Dell [[www.dell.com/uk/business/p/poweredge-c6220/pd|PowerEdge C6220]] compute servers. This is a very dense compute server, with four independent nodes in a two unit chassis. Each node contains two Intel Xeon E5-2690 8-core 2.9GHz processors (32 threads with hyper-threading), 128GB of RAM and two 1TB hard drives. . 2 x IBM [[http://www-03.ibm.com/systems/uk/x/hardware/rack/x3750m4/|System x3750 M4]]. Each server has four Intel Xeon E5-4650 8-core 2.7GHz processors (64 threads with hyper-threading), 512GB of RAM, two 300GB hard drives and twelve 1TB hard-drives. . 4 x Dell [[www.dell.com/uk/enterprise/p/poweredge-r720/pd|PowerEdge R720]]. Each server has two Intel Xeon E5-2640 2.50GHz six-core 2.5Ghz processors (24 threads with hyper-threading), 64GB of RAM, two 300GB hard drives and 24 1TB hard-drives. . 1 x NetApp [[http://www.netapp.com/uk/products/storage-systems/fas2200/fas2200-product-comparison.aspx|NetApp F2240A-2]] dual-controller Filer and disk-shelf; raw storage capacity 60TB. . 4 x Extreme [[http://www.extremenetworks.com/products/summit-x670.aspx|Summit X670]] 10GbE switches. Here are our old notes: [[project/privatecloud/hardware|Hardware Investigations]] == Software Investigations == CSG have been familiarising themselves with various possible open source cloud or storage software systems that might be able to implement some/all of the required iaas cloud services, and performing some initial investigations of a few of them. While the Cloud Manager will of course be responsible for designing and building the cloud, existing members of CSG are concerned to '''map the terrain''' to find out where the dragons are lurking and to provide an '''existence proof''' to reduce the risk that after buying the hardware, no software can be added to build the desired cloud. Here are our notes: [[project/privatecloud/investigations|Software Investigations]] |
DoC Private Cloud: 2012 - 2013
Project Goal
Build a DoC Infrastructure-as-a-service private cloud, very like Amazon EC2 ("Elastic Compute Service") which presents a secure and convenient web interface which enables users of DoC to specify and create VMs and associated storage, automatically install OSes on them and deploy them.
The main goal is to virtualize most research servers, decoupling the OS image from the hardware for greater flexibility. Sharing (amortizing) the costs of each machine. One driver of this is EPSRC deciding to only provide 50% of any hardware bid over £10K in future, with the Dept expected to pay the remaining 50%.
This project will definitely proceed, having been approved by Executive Committee and by two open meetings of Academic staff.
Peter McBrien (PJM) is leading the project, and has laid out two stages:
- a 6 month phase in which CSG (advised by an academic working group) will design and build a prototype cloud, recruiting a "Cloud Manager" person to join CSG, possibly for 6 months in the first instance. The Department will spend some significant amount of money to build the prototype cloud, perhaps in the £100-200K range.
- assuming the prototype cloud is successful, it will move into production and the "Cloud Manager" become permanent. Researchers would then be encouraged to add research-funded hardware to the cloud and given some form of preferential treatment on "their hardware". All members of CSG are enthusiastic to gain cloud-related skills from the "Cloud Manager". (By the way, the Cloud Manager will do non-cloud-related systems administration too).
Most crucially: (despite not knowing the exact set of services to provide, let alone how to implement them, or having yet appointed a Cloud Manager person) the Department's substantial initial investment must be fully spent before the end of July. This means that all kit must be ordered, delivered and paid for before the 31st of July. With the Olympics making deliveries difficult, this means that everything must have been ordered by 1st July.
Re: amount to spend, Anne O'Neill (AON) and Peter McBrien (PJM) suggested CSG prepare possible collections of commodity hardware, on the assumptions that either £100K, £150K and £200K (inc vat) would be spent.
The Problem We're Trying to Solve
At present, research groups buy clusters when they have money, CSG set them up, install the current supported Linux or Windows release on them (the CSG supported Linux release currently changes each year), optionally configuring storage and fileserver nodes, arranging tape backups of important data, adding special software etc.
Then the servers age, after the first year the OS becomes essentially frozen apart from minor security updates. It's often difficult to persuade researchers that we should reinstall their fileservers, webservers and compute nodes. They become "fragile", and eventually a security risk.
Sometimes it's hard to retire them when the hardware becomes more than 4-5 years old, because of the "fragile" software setup on them.
A second problem is that these clusters are often only accessible by members of the specific research group that bought them, so the resource may not be fully utilised.
Instead, the idea is to setup a private cloud, researchers add hardware to that cloud's core resources, then create a VM for each cluster node, perhaps tied (1-1 at first) to their own hardware, the creation process should automatically install a CSG-supported operating system (historically supported Linuxes and Windows versions) on the new VM. Researchers work as before on each VM - but each node is encapsulated inside a VM.
Later, these VMs could share resources - when the group don't need 100% resources, or new more powerful hardware is purchased and the VM migrated to it.
We would also gain to flexibility to create short-term VMs for specific "run this software on 16 nodes" experiments. Such short-term VMs would then be automatically destroyed.
We could even give every DoC user (students and staff!) their very own VM when they join, with full root/admin access - or at least the ability to create one when they first need it (lazy evaluation:-)).
Open Staff cloud meetings
In April 2012, the discussion was opened out to all interested staff, and (so far) two open staff cloud meetings have been held. Here are some notes taken by DCW and LDK of the discussions at both meetings.
Open Staff Meeting 1 - April 3rd 2012
Open Staff Meeting 2 - April 25th 2012
Cloud Access URL
The end-user interface for the DoC Private Cloud is now available for departmental users via cloudstack.doc.ic.ac.uk/client. Please use your normal college user-name and password for authentication; the domain should be imperial.
Cloud Hardware we bought
Susan expressed a preference for open source software running on commodity hardware. A strong advantage of this approach is that commodity hardware could be repurposed, even if the cloud project failed completely, or got scaled down to use a tiny proportion of the hardware.
Peter Pietzuch recommended that we also investigate commercial scalable storage systems - specifically NetApp "scalable NAS" units - which his colleagues at the University of Cambridge Computing Laboratory have used for years. CSG discussed NetApps with colleagues in ICT and Martyn Johnson at the Cambridge Computing Lab and then bought one.
Here is the hardware we have bought for the cloud. More can be added later (eg. by research groups opting in):
4 x Dell PowerEdge C6220 compute servers. This is a very dense compute server, with four independent nodes in a two unit chassis. Each node contains two Intel Xeon E5-2690 8-core 2.9GHz processors (32 threads with hyper-threading), 128GB of RAM and two 1TB hard drives.
2 x IBM System x3750 M4. Each server has four Intel Xeon E5-4650 8-core 2.7GHz processors (64 threads with hyper-threading), 512GB of RAM, two 300GB hard drives and twelve 1TB hard-drives.
4 x Dell PowerEdge R720. Each server has two Intel Xeon E5-2640 2.50GHz six-core 2.5Ghz processors (24 threads with hyper-threading), 64GB of RAM, two 300GB hard drives and 24 1TB hard-drives.
1 x NetApp NetApp F2240A-2 dual-controller Filer and disk-shelf; raw storage capacity 60TB.
4 x Extreme Summit X670 10GbE switches.
Here are our old notes:
Software Investigations
CSG have been familiarising themselves with various possible open source cloud or storage software systems that might be able to implement some/all of the required iaas cloud services, and performing some initial investigations of a few of them. While the Cloud Manager will of course be responsible for designing and building the cloud, existing members of CSG are concerned to map the terrain to find out where the dragons are lurking and to provide an existence proof to reduce the risk that after buying the hardware, no software can be added to build the desired cloud.
Here are our notes: