In a way, building IT infrastructure isn’t particularly exciting. Want to turn a Linux server into a MySQL database server? Enter some commands to install MySQL. Congratulations, you have successfully used some data (yum -y install mysql-server) to change the generic model (the out-of-the-box operating system). Much of our infrastructure is built the same way.
Chef takes this idea to its logical conclusion by formalizing the relationship between data (attributes, or properties of a machine) and the model
(recipes and cookbooks). Therefore,
one of the most important things for Chefs to contemplate is where to put their data and how to model them. Your data model changes the syntax and structure of your recipe code, and vice-versa. Therefore, it’s important to consider the design
of attribute data structures, but also when and where to use data bags.
When Chef was first invented, there were only node attributes. In fact, there was only one level of node attribute: what we know now as “set” or “normal”. As more data structures got added to Chef — roles and environments being chief among them — attribute precedence and merge order was invented to resolve conflicts, and “default” and “override” levels were added for even more flexibility.
It soon became clear that a higher-order, global data structure was needed. Hence, data bags were born. If you’re a Dungeons & Dragons player, you’ll notice that the name is a humorous take on “bag of holding”.
Data bags are generally used to hold global information (“data bag items”) pertinent to your infrastructure that are not properties of the nodes themselves. In the majority of scenarios, you will continue to model most of your infrastructure using node attributes. Here are a few guidelines for whether a data bag item should be used to represent a piece of data:
If none of these conditions is true, implement the configuration as an attribute.
One of the strengths of the Chef server is that its API is well-documented, open, and easy to integrate with. Experienced customers and open-source users alike have written plugins and add-ons to the Chef client to enable it to do things that our software engineers never even thought of.
However, it is often undesirable for an external system to write to a node attribute, for the simple reason that it would need to write to one of three objects: a cookbook, a role, or an environment. A mistake could have far-ranging side effects beyond just the intended change. On the other hand, modifying a data bag item is a small, self-contained operation. The programming API is also far easier to use.
Here’s an example of a custom Ruby script that uses Chef as a library to update application release information for an app “foo”:
require 'net/http' require 'chef/rest' require 'chef/config' require 'chef/data_bag' require 'chef/data_bag_item' bagname = 'myapps' appname = 'foo' version = '1.0.0' # Use the same config as knife uses Chef::Config.from_file(File.join(ENV['HOME'], '.chef', 'knife.rb')) # Load data bag item, or create it if it doesn't exist yet begin item = Chef::DataBagItem.load(bagname, appname) rescue Net::HTTPServerException => e if e.response.code == "404" then puts("INFO: Creating a new data bag item") item = Chef::DataBagItem.new item.data_bag(bagname) item['id'] = appname else puts("ERROR: Received an HTTPException of type " + e.response.code) raise end end item['version'] = version item.save
Many customers use this approach to implement a continuous delivery pipeline. Successful completion of a pipeline stage (e.g. “passed unit tests”) might be a corresponding data bag update (e.g. “update QA’s data bag item with the build #”). Next, the Chef recipe handling application deployment will pick up the change and deploy the new version of the application. This can happen either asynchronously (next time Chef Client runs) or synchronously (by using “knife ssh” or even a Push Job as another pipeline action).
To summarize, data bag items provide a way to store data that is not directly associated with any particular node in the infrastructure. Data bags are also searchable: the name of the index is the name of the bag, so don’t name a bag “role” or “node” or it will never be found!
Keep data bag items small. The data is transmitted from the server to the client on every Chef run, so you don’t want an 8K data bag item being queried by 1000 machines every 15 minutes — that’s 32MB/hour of JSON on the network!
Finally, if in doubt, store data as a node attribute, until you find a need to convert it to a data bag item. You can always refactor your code.