The hand-out from the C(OS)2

Ad hoc solutions to site configuration lead only to job security for the perpetrators. A better plan is repeatable, practical and structured engineering for every data center resource. We "manage" processes with highly variable inputs (e.g food, worker output, children, and weather), we engineer bridges, roads, machines, and data centers. This is not an "all or nothing" proposal; adopt only the engineering processes and policies that work for you. In this presentation I'll explain both the conceptual process and an example implementation (or 2 if we have time). All the tools mentioned and most of the policies I used are available for your use after the talk. Linux RPMs will be available for the code.

The same engineering practices are used a Indiana University, Purdue, FedEx, Facebook, Google and other organizations.

Layer 5 -- site policy, long-term planning, organizational goals

Every action taken on behalf of the organization should be in alignment with that structure's goals; this is where we set those goals.

Site policy answers questions like these:

The policy we set is a method to come to new and better solutions, not] a set of limits on the potential of the organization. When any policy limits us from meeting another policy, then we have to push the conflict up to layer 5 to get them back in alignment. Usually policy is set by management or consensus amongst peers.

Layer 4 -- running platforms implementing site policy

The data center, the test clusters, the sandbox hosts, and the workstations all implement site policy as running instances. This includes the routers, switches, disk arrays, PDUs, and even paper documents.

Without a policy (we compare our work to) we will be lost in a sea of possible options. So we need a signature to describe each type of entity: a blue-print for a webserver, for the switch it is connected to, for the routers, and the firewalls, and the other parts that make the site work. That extends to desktops, printers, backups, spare parts and every little thing.

Layer 3 -- package release, dependencies, and upgrade paths

Packages assure that a set of related products (below) all agree. So protocols are in-sync, documentation matches code, and function follows requirements as they were when the package was built.

Without packages code, documentation, and protocols tend to drift which causes both proximal and distal errors. See my structural complexity notes. It is quite possible to use all layer 2 products to build workable structures; I've have little success doing that without some layer 3 policy which requires testing at layer 3, even without packaging at layer 3.

Layer 2 -- products by version, matching documents, and plans

Products (programs, books, papers, related configuration files) all require a set of files to reproduce results. At the minimum we need a recipe file to drive the creation and installation of the files, and a recipe in that file to reset the configuration automation to a ready state. That is in addition to the payload, which may require many files to build.

So a product is usually represented by a directory (aka folder) in a hierarchy that mocks the installation location of the product. For example the source to /bin/test might be kept under /usr/src/bin/test/. Wherein we might find files like test.c and and a make recipe file like Makefile. The recipe file at least knows how to build and install the application and the documentation.

Layer 1 -- files by milestones with revision history

At the bottom we need to keep track of changes to every file that builds the structures above. To do that we need to:

Note that these attributes are passed up to the layers above by virtue of the elements above all being built from these files. The transitive nature also applies to the other layers, packages make the signature of a host less verbose, as more features and files are represented by a package than a product.

Finger files -- ad hoc changes

This layer is what everyone does without configuration engineering. Using their fingers to fix the configuration of every file each time it needs an update. Usually as things break. It is not listed in the figure below.

Everyone uses fingers (or voice) to input raw files. That's not what we mean: we mean that using those files without any controls leads to errors and a lack reproducible results. What Bob typed in for /etc/resolv.conf last year he might not remember next year. So we put all configuration files in revision control.

The dicussion

I used some examples to highlite each of the layers and what made them work. Then we went over some example that express how each level helps recover from errors, before they create distal issues that are hard to trace back.

We talked about the example of running out of gas: the empty tank is a proximal cause, but the full credit card is the distal cause. Note that the proximal is easier to see, the distal might show other issues (like not being able to eat either).

We talked about collected data being actionable: even power readings might be used to note a failed disk or power supply -- which is exactly as actionable as knowing which PSU or disk failed. So maybe we don't need to SNMP sample every disk in the frame?

We talked about why I use a change log to record all the updates and changes at every level. The revision control logs per file, the product TODO files, the request queue entries and an operational log of changes to each instance. These all bring information closer to the people and processes that need it.

The other pages we looked over at the end

See the error page that I put up for that part. The we looked at the master source implementation page a bit -- but that one is a whole additional presenation. We also looked a the manual page for hxmd(8l) (*). Mostly to show the redo feature that brings remore errors back to the server that asked for the update (making a distal error more proximal and actionable).

There are other manual pages I had up, but we didn't really use them. They are all linked from the master source document, and we might have a talk about those someday, if requested. First we should to the structural complexity talk.

I promised a link to John C. Doyle. And his truely awesome work, like lecture 3 page 63, 85-118, 127-167 if you are into it. Or the index of his lectures read them all is you are really into it.

He is a very smart man, by any metric. Check out the number of papers he has credits on, and in how many fields of study!

As always

"The point of these teachings is to control your own mind: use only as directed"

Follow-up links

Missing a 2 person task rule? Here is a $464M reason to have one.

(*) It is really cool that my manual page CGI allows a to link to any section.

$Id: index.html,v 1.3 2013/10/23 16:19:21 ksb Exp $ by .