The following is a post from the folks at Cycle Computing, who are doing some mind-blowing work with Chef in HPC environments on Amazon Web Services. They were kind enough to let us cross post it here on the Opscode blog.
Why Baking Your Cluster AMI Limits the Menu: DevOps for HPC clusters
You may have read our last blog post about Tanuki, the 10000-core HPC supercomputer we built to predict protein-protein interactions. We’re back to tell you a little bit about how we provisioned the 1250 c1.xlarge instances that made up Tanuki. In fact, it’s the same technology that builds all of our CycleCloud instances whether you select a single-node stand-alone Grid Engine cluster or a super kaiju Condor cluster like Tanuki. But before we get into how we do things today, lets talk about where we’ve been and what we’ve learned.
Pre-Built Custom Images: Basic Cloud Cluster “Hello World”
It seems every one’s first foray into building HPC clusters in a public virtual cloud (like Amazon’s EC2) involves baking a specialized image (AMI) complete with all the tools required to handle the workload. The most basic architecture includes software for pulling work from a queue or central storage location (e.g. Amazon’s SQS or S3), running the work, and pushing the results back. If you’re feeling especially clever, you may even use a scheduler like Condor, SGE, or Torque.
This first cluster comes up fast, but like all first attempts, probably has some bugs in it. Maybe you need to fix libraries to support your application, add an encrypted file system, or tweak your scheduler configuration. Whatever the case, at some point you’ll need to make changes to it. If you’ve got just one cluster with a handful of nodes, making these changes manually can be done but it’s a pain. Alternatively, you can make your changes, bake new images and restart your cluster with the new images. This is a slow development cycle. It’s painful, but it can be made less so with automated build and test scripts.
With a little time and some perseverance you can get yourself to a point where you’ve got a couple of clusters up and they are working hard for you. Maybe you’ve got a few different image types to handle different types of workloads with different scheduler configurations for each cluster. Maybe you’ve even created images for different operating systems. But then there’s a critical OS security update, or you realize you need to change encryption keys on all your nodes. What do you do?
Figure 1: Explosion of AMI versions
Baked AMIs create problems for everything but the most trivial of use cases: How long will it take you to make new versions of your images and replace the old with the new? How do you keep track of what image versions are affected? Wouldn’t it be easier if we could just change all these instances on the fly?
Absolutely! And you can make that happen with automated configuration management.
Base Images Configured with Chef
Having experienced and learned from many of the nuances of building AMIs, configuration management and scaling clusters over the years, we have chosen to use Chef, the open-source project from Opscode, to configure our CycleCloud cluster images. We maintain a small number of base images and layer cluster and role-specific software and configuration on top of the base images using Chef cookbooks. We find the ability to describe the configuration of an instance with a simple Ruby domain specific language to be invaluable. Why?
- Cookbooks are easy to write, read and understand
- Cookbooks are version-controlled in our git repository
- Cookbooks can be modified and tested without baking a new image
Chef allows us to make changes to running clusters. We can patch operating systems, change scheduler configuration, grow file systems and more just by changing a few lines of code and waiting for the scheduled Chef client run.
Figure 2: The DevOps update procedure is much more efficient
So the “get coffee” part is cool! But baked AMIs, for anything but trivial usage, wastes your time, users time, and $$$. DevOps configuration methodologies are a key part of good cloud management.
Scaling Chef
In a typical datacenter environment, the life cycle of a machine can range from weeks to years. In general, bringing new hardware online is staggered, allowing a configuration management system such as Chef to breathe. It tends to only do the heavy lifting when performing the initial configuration of a brand new machine. In the world of cloud computing, the life cycle tends to be from minutes to weeks. In addition, the launches tend to be stacked right on top of each other, meaning the demands on the Chef infrastructure are immediate and requests come one after another in rapid succession causing immediate and significant load for Chef.
When we started regularly launching clusters that contained hundreds of machines (thousands of cores) instead of tens of machines (hundreds of cores) – our 4000-core cluster Oni is a good example – it became immediately apparent that we needed to work hard to scale our Chef infrastructure. It was obvious to us that we had to be able to handle the spike of activity as bare bones EC2 instances came online as naked as the day they were born. Once these systems are converged within Chef, the configuration management load is significantly less, both because there are fewer changes that need to be made and because we decrease our polling interval as a result.
We couldn’t publish a blog post without including some technical details, so here’s how we initially scaled our Chef servers to handle the Tanuki load. We took inspiration from one of the only blog posts we could find on the subject:Joshua SS Miller’s Chef Tuning Part 1. The following tips are for CentOS 5.5.
First, start with a larger EC2 instance, one with quite a few cores to handle the additional application server processes. Then, turn off your chef server and edit /etc/sysconfig/chef-server. Change the OPTIONS line to start as many processes as you wish. In this example, we are only starting two.
———————–
# Configuration file for the chef-server service
#CONFIG=/etc/chef/server.rb
#PIDFILE=/var/run/chef/server.pid
#LOCKFILE=/var/lock/subsys/chef-server
#LOGFILE=/var/log/chef/server.log
PORT=4005
#ENVIRONMENT=production
#ADAPTER=thin
#CHILDPIDFILES=/var/run/chef/server.%s.pid
#SERVER_USER=chef
#SERVER_GROUP=chef
# Any additional chef-server options.
OPTIONS=”-c 2″
———————–
Now start chef-server back up and verify that you have the right number of merb worker processes.
———————–
-bash-3.2# ps -ef | grep merb
root 7285 1 0 May11 ? 00:00:02 merb : chef-server (api) : spawner (ports 4005)
chef 7287 7285 0 May11 ? 00:11:46 merb : chef-server (api) : worker (port 4005)
chef 7288 7285 0 May11 ? 00:11:38 merb : chef-server (api) : worker (port 4006)
root 12859 12830 0 09:39 pts/4 00:00:00 grep merb
You will now proxy chef-server using Apache. Install apache, then create a new configuration file in /etc/httpd/conf.d called chef_server.conf. Add these lines:
Listen 4000
ServerName devchef.cyclecomputing.com
DocumentRoot /usr/lib64/ruby/gems/1.8/gems/chef-server-api-0.9.12/public/
<Proxy balancer://chef_server>
BalancerMember http://127.0.0.1:4005
BalancerMember http://127.0.0.1:4006
</Proxy>
LogLevel info
ErrorLog /var/log/httpd/chef_server-error.log
CustomLog /var/log/httpd/chef_server-access.log combined
RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_FILENAME} !-f
RewriteRule ^/(.*)$ balancer://chef_server%{REQUEST_URI} [P,QSA,L]
———————–
You will need to add more BalancerMember entries if you set /etc/sysconfig/chef-server to start more than two processes. You may also have to modify the DocumentRoot to point to the location and version of your chef-server-api gem.
Reload apache with /etc/init.d/httpd reload and test with a knife command.
What’s Next
In general, we were universally pleased with the performance of CycleCloud as we began to launch these “mega scale” clusters within cloud environments. However, there are still improvements to be made. More specifically, we are always looking to maximize the amount of useful compute work that can be done for the compute time that our clients have paid for. We want to see clusters of arbitrary scale launch as quickly as possible, and we want the configuration of the cluster to proceed quickly as well. Stay tuned for our architectural solutions to these challenges. Rest assured that Cycle will not stop driving scalability by generational leaps while making the process of launching a cluster easier and quicker than ever.
As always, if you have questions, or comments, or also love big clusters, please contact us.