In the first part of this series
we got the Chef Automate
Pilot
container stack up and running on ECS. Now let’s make it survive termination of
any container or EC2 instance without losing data by adding AWS RDS, EFS and
Elasticsearch. A story told in 3 git commits:
We know that Chef Automate stores almost all of its state in PostgreSQL, but is
also a Git server and stores repositories on disk. We need to get all of that
data out of the disposable container volumes and into highly available
persistence stores.
For starters, I’ve borrowed a chunk of code from another template I
own
to add the AWS::RDS::DBInstance
resource and friends — that was the easy part.
Trickier is to extract the postgresql
and postgresql-data
containers from
the stack.
Once you remove those two from the ContainerDefinitions
, you have to take care
of two things:
postgresql
to be therabbitmq
container because--bind
database:postgresql.default
passed to the Habitat Supervisor) which have to beEnvironment: - Name: HAB_WORKFLOW_SERVER Value: !Sub | sql_user = "${DBUser}" sql_password = "${DBPassword}" [postgresql] vip = "${DBPostgres.Endpoint.Address}" port = 5432
This is where the magic of Cloudformation combines with the magic of Habitat — I
can easily pass information about the RDS instance (hostname, username,
password) in to Habitat’s runtime configuration and it does the right thing with
that. As you can see, CloudFormation’s new-ish YAML
format
makes variable interpolation delightful in multi-line strings, especially
compared to the JSON format.
Now you may wonder: how are you supposed to know what configuration to pass in
to a particular Habitat package? The awesome Habitat depot site has the
answer
(scroll down to the Configuration section to see all the variables Habitat’s
TOML config format). The Habitat docs
describe the methodology for passing in runtime configuration via environment
variables
although I was never able to get the JSON format to work reliably (Habitat
auto-detects the format of the variable).
Removing the Habitat
bindings and switching to
environment variables worked fine for all of the services in the stack except
the new notifications
service. I didn’t realize when making this first commit
that notifications
didn’t need to talk to Postgres at all, so I passed in the
environment variables and then had to “fake it out” into skipping the
startup-time bind wait by passing in a phony bind: --bind
(friends, don’t try this at home, it’s a bad idea and
database:rabbitmq.default
won’t work).
Later on I realized that notifications
configuration file was missing an
important bit of code to make the binding conditional and used pkg_binds
instead of pkg_binds_optional
in the plan. That led to this PR back to
automate:
From 4fb2f312c60cd10f57c08051d4243d3e62b1fbfb Mon Sep 17 00:00:00 2001 From: Irving Popovetsky <irving@chef.io> Date: Fri, 18 Aug 2017 13:05:55 -0700 Subject: [PATCH] Fix notifications Habitat binds so that they can be optional and remove an unneeded one for database --- diff --git a/notifications/habitat/config/env b/notifications/habitat/config/env index faa23c642..19d70da22 100644 --- a/notifications/habitat/config/env +++ b/notifications/habitat/config/env @@ -1,9 +1,20 @@ +{{#if bind.elasticsearch}} + export ELASTICSEARCH_URL={{bind.elasticsearch.first.sys.ip}} +{{else}} + export ELASTICSEARCH_URL={{cfg.elasticsearch_host}} +{{/if}} + +{{#if bind.rabbitmq}} + export RABBITMQ_HOST={{bind.rabbitmq.first.sys.ip}} +{{else}} + export RABBITMQ_HOST={{cfg.rabbitmq_host}} +{{/if}} + + export HOME="{{pkg.svc_var_path}}" -export RABBITMQ_HOST={{bind.rabbitmq.first.sys.ip}} export RABBITMQ_VHOST={{cfg.rabbitmq.vhost}} export RABBITMQ_USER={{cfg.rabbitmq.user}} export RABBITMQ_PASSWORD={{cfg.rabbitmq.password}} -export ELASTICSEARCH_URL={{bind.elasticsearch.first.sys.ip}} export AUTOMATE_FQDN={{cfg.automate.fqdn}} export PORT="{{cfg.port}}" export REPLACE_OS_VARS="true" diff --git a/notifications/habitat/plan.sh b/notifications/habitat/plan.sh index a64ae9889..c339c2adb 100644 --- a/notifications/habitat/plan.sh +++ b/notifications/habitat/plan.sh @@ -11,8 +11,7 @@ pkg_deps=( pkg_build_deps=( core/make ) -pkg_binds=( - [database]="port" +pkg_binds_optional=( [elasticsearch]="http-port" [rabbitmq]="port" ) </irving@chef.io>
While waiting for that to get accepted I build some of my own docker
containers with those changes, which
Habitat makes easy with the hab pkg export docker
command. That way I could
quickly test those changes in my stack and get feedback quickly.
AWS EFS provides a way for persisting data across container restarts, and even
providing concurrent access to files across containers (if you can do that
safely) and that’s good old-fashioned NFS. Except it isn’t because EFS provides
multi-AZ availability and NFS 4.1 isn’t exactly old fashioned — providing
parallel access (pNFS), file locking and significant performance improvements.
You still shouldn’t use it host your database
files
but for low-intensity IO it is totally fine and buys us a ton of flexibility.
As various AWS
articles
demonstrate, there are a few slightly awkward things about integrating EFS in to
your ECS cluster:
AWS::EFS::MountTarget
for each subnet you operate in.Once that’s done, you can now tell ECS exactly where to put those volumes on the
host (hint: on to the EFS mount) like so:
Volumes: - Name: maintenance Host: SourcePath: !Sub /mnt/efs/${AWS::StackName}/maintenance
It took some digging around to realize exactly what containers data I should be
mounting on EFS (I still don’t really know what that maintenance
volume does),
but in later commits I start putting the Habitat data
volume for key
containers there. So let’s move on to that!
Okay so sometimes I get a bit punchy with my commit messages. Just like that
Death Star, we’re hilariously not fully operational yet :D
What we’re doing here is replacing the elasticsearch
container with AWS’s ES
(Elasticsearch) service, which is where all that sweet visualization and
reporting data is going to go. Now it would be super cool if ES had a simple
access control scheme like RDS (VPC SecurityGroups) but no, they just had to be
different!
AWS ES controls access by IAM roles, and isn’t integrated with VPC at
all
(your traffic goes to a public IP). Each request to ES must be signed the same
way that AWS API requests are
signed.
Fortunately my team and I had already appropriated a useful bit of code for
that: the aws-signing-proxy
(credit to Chris Lunsford for the original code).
All I needed to do was habitize that, which was super easy when using another
go-based application as an example:
One thing I realized was that you need to export http-port
instead of port
(in the plan.sh) just like the Habitat elasticsearch
package
— that way binds that previously depended on elasticsearch
could now depend on
the aws-signing-proxy
service as a drop-in replacement.
Watching the container logs in Cloudwatch Logs, I noticed that other services
were taking a while to start up because they were re-creating various files at
init time. For example, the automate-nginx
container had a bit of code like
this:
# Generate a private key if one does not exist. cert_file="{{pkg.svc_data_path}}/cert" key_file="{{pkg.svc_data_path}}/key" if [[ ! -f "$cert_file" ]]; then openssl req \ -newkey rsa:2040 -nodes -keyout "$key_file" \ -x509 -days 3650 -out "$cert_file" \ -subj "/C=US/O=Chef Software/OU=Chef Delivery/CN=#{{cfg.server_name}}" chmod 600 "$cert_file" "$key_file" fi
Thanks to some very forward-thinking developers, we can skip the expensiveopenssl
key generation step on subsequent starts by mounting the Habitat data
folder on EFS!
Habitat is already smart enough to instruct docker to mount the/hab/svc/servicename/data
directory as a separate volume, as I found digging
around via docker inspect
. So instructing docker to mount that on EFS made
sense for most of the containers.
One debatable service was rabbitmq
. I chose not to mount this on EFS because I
was concerned about the performance impact of doing so, particularly in high
scale scenarios (which is our ultimate goal, after all!) Also, in our experience
RabbitMQ tends to be greatly impacted by slow disk when handling large durable
queues, so let’s give it as fast disk as we can.
Now we have a Chef Automate container stack that can survive a wide variety of
faults up to and including instance termination — and can recover with all of
its data in just a couple of minutes.
In the next post, I’ll start working on multi-host operations for even better
availability as well as options to scale-out.