“The Visible Ops Handbook” and MFCF

Introduction

The Visible Ops Handbook, by Kevin Behr, Gene Kim, and George Spafford, is a small but high-density handbook describing what is necessary to achieve and ensure the efficient and reliable operation of an IT department.

Most IT departments are in an obviously bad state, something that can easily be verified by considering the percentage of time spent on firefighting and handling emergency problems. This book provides a step by step procedure for turning such a department around.

The Testimonials and Forward describe how wonderful this book is. The Introduction similarly praises itself, but also gives some background to why it is needed and some observations as to what good IT departments (and other organizations) all have in common.

Several appendices provide a glossary, audit hints, references, and reprints of a few related articles.

But it's the 40 pages of actual content that make this book worthwhile.

The four chapters don't say how to organize things, leaving that to the ITIL standard, but instead provide three sequential steps that will allow one to correct systemic problems and bring the organization in line with standard practices and a fourth step that ensures retention of these changes and a guarantee of sustainable improvement.

Stop things from getting worse
Identify and document everything
Simplify and automate everything
Enable continuous improvement

Each of the first three phases can take a month or a year to complete, depending upon how bad the initial state is, and how well management and staff can work to achieve each goal. Those that don't see the potential benefits of this approach need to understand that brakes allow cars to go faster.

Phase 1

Often, a significant amount of work within IT is self-inflicted. Nearly every service interruption is caused by a planned change with unplanned consequences, and most of the resulting recovery time is taken up in determining what that change was. That's worth repeating; most of the time it takes to recover from a problem is taken up by determining what change caused the problem. Determining what changed recently, who make the change, and why, can be very time consuming tasks, often much more difficult than what it eventually takes to correct the problem itself.

The first step is to stabilize the patient by implementing a strict no one changes anything policy, starting immediately, and rigidly enforcing it. This will almost guarantee continuing reliable service. It will also eliminate any possibility of improvement, but that comes later.

The next step is to allow changes, but only via a well documented process. Any change must be fully documented (e.g. what the change is, its purpose, who wants it, who will do it, consequences of failure, backout plans, who approved it). Even routine changes that generally don't need specific approval should all be recorded in a well-known place.

The next time something goes wrong, determining what change caused it should be much easier and the recovery much faster.

The success of this phase requires an enforced zero tolerance policy for undocumented changes.

Phase 2

We now have a relatively stable environment, with all changes thoroughly documented, so it's time to look at everything we have and document it. In particular we need to identify the equipment, services, etc. that cause the most unscheduled work, and those whose loss would have the severest impact.

This phase involves identifying and cataloguing everything. That there is so much to document is itself an indication that things are not as they should be. Hardware, software, and procedures can easily multiply into many different versions, and each variation adds more work in terms of maintenance, replacement, documentation, and user knowledge.

It's important to include staff as part of everything, by documenting exactly who is responsible for what. Such a list must include every possible responsibility, otherwise too many things will end up having to go to top management.

Similarly all procedures must be documented.

Phase 3

We have a relatively stable environment, with all changes approved and logged. Everything is completely catalogued and documented, and everyone knows the procedures that must be followed. So now we can begin to make it simpler and more reliable.

The number of variations that were identified in Phase 2 must be reduced. For instance, buying cheaper PC Intel boards can result in requiring dozens of different boot images to match the peculiarities of each batch of equipment purchased. On the other hand, while buying dual-boot Mac Intel machines with consistent and predictable hardware configurations is initially more expensive, it can save a lot of time, effort, and money in the long run.

Similarly, having a dozen identical servers is good, but it's very easy later on to add a special service to one, and then another special service to another, and so on. Identical servers should remain identical. Special services should be on their own dedicated servers (or better yet, they should be avoided).

We must also reduce the complexity of each item. Since everything is documented, in theory we should be able to throw away any piece of equipment or installed software and rebuild it from scratch, ending up with something that is identical. In the past this hasn't been possible, but we must make it so now. This is particularly important for both the troublesome and the critical items identified in Phase 2.

Much of this step involves creating libraries of common knowledge/hardware/software that can be used to rebuild multiple items. General purpose designs, configured for specific purposes, are far better than specific purpose designs.

The days when someone could make a quick change to something to make it work better, to fix a bug, or whatever are over. If someone else, rebuilding that item from scratch wouldn't know to make that same change, then it isn't documented right.

Phase 4

We now have a stable environment where almost nothing goes wrong (because nothing is done without careful planning through well-defined procedures), where recovery time from those rare problems is minimal (because all changes are readily determinable), and where in the worst case the service can quickly be recreated from scratch. This is exactly where we should be. All that remains is to make it even better.

The best way to improve something is to have an objective measure of how good it is, and to work on improving that measure. Unfortunately, in many cases once the measurement system is known it's easy to work towards improving the measurement itself without improving what it was intended to measure.

So instead of attempting to measure actual results, it's better to measure the attributes of the ongoing activities known to contribute to the overall results.

For instance, what percent of our time is spent on planned activity and how much on emergencies and firefighting? And within those emergencies and firefighting, what are their general causes? IT departments typically spend over two thirds of their resources on unplanned activity, but following these four phases can reduce it to less than a quarter. Looking at it in terms of productivity, that means increasing the department's useful work from less than 33% to over 75%, equivalent to increasing the work force by more than 125%.

That result does sound somewhat fantastic, but in only a few days after having read the book, I've already become very aware of how serious this problem is within MFCF, and can easily believe it.

Such measurements might also indicate the need for additional staff duties (if not actual additional staff). If there is a common factor that generates a significant amount of unplanned activity and that activity is often of the same type, perhaps that should be turned into a scheduled and budgeted item. (Did someone mention providing support for the Dean's office?)

Many other useful measurements for improvement are also given.

How this book doesn't relate to MFCF

This book is filled with excellent advice, almost enough to justify the many pages of self-praise that pads it out. And MFCF certainly experiences most of the problems for which this book offers cures. Unfortunately, it is aimed at much larger organizations than MFCF.

For instance, it talks about several levels of management, committees, designers, implementers, users, etc., and in many cases within MFCF all those people would be the same person.

And MFCF is a service organization, we don't have a bottom line that shows a profit at the end of the year. Hiring two people to increase profits by several hundred thousand dollars a year is easy to justify; hiring someone simply to improve service isn't.

How this book does relate to MFCF

Even so, there are many concepts that MFCF could benefit from.

Virtually every page offers valuable information, much of which could be applied even to small organizations, though perhaps not as easily nor with as effective results.

The most critical factor though seems to be the ability of management to monitor and enforce policies, and to rapidly respond to requests. If MFCF's management could operate at this level, we could significantly benefit from the suggestions in this book.

Interestingly, MFCF used to follow many of these suggested principles instinctively. We were required, whenever we made a change, to record it in the uw.mfcf.software newsgroup. Similarly all configuration files required the use of RCS to record exactly what changed and why. These are two of the most important steps in implementing Phase 1, but unfortunately they are no longer as common in practice as they used to be.

Problems within MFCF

Applying the visibleOps concepts might prove difficult.

MFCF has a long history of what the authors call cowboy mentality. In our golden age, that was actually a good thing, though I don't think anyone would argue that it is now. But the practice is long established, and it's difficult for some of us to not simply fix a problem when we see it.

And at the other end of the organization, we have a long history of weak management. Again, that used to be a good thing, with the Director (a part-time position) simply setting overall direction, saying this is what I want and then having faith that it will happen without any concern about how. But there's no way that can work now.

Five or six years ago, when Bill ran almost everything by himself, it almost worked, but certainly since then we have not had good management, originally because the managers themselves lacked the necessary management knowledge and experience, and later with Tom because staff didn't know how to handle being managed.

In order to make things work, we need a lot more management oversight. Definitely not the micromanaging that started when John was Director, but oversight in the sense of continually monitoring overall statistics and trends and immediately putting things back on track. While management should be flexible with respect to how individual staff do their work, it needs to be far more anal retentive about the overall running of the department.

At the moment, we don't have the management skills, personalities, or numbers to run things the way we should. e.g. Xxxi and Ray would be great at the anal retentive part, but I think we lack more than a few of the other required attributes for management. And while Tom does have the management background, he's only one person.

E.g. we currently have 44 RT items that are past their due date, an obvious situation that could be of concern. Similarly our inventory database contains nearly a thousand unknown items and dozens of discrepancies with DNS. If over the last few years staff still hasn't cleaned up such problems and management hasn't done anything about them, how can we hope to accomplish the vastly greater tasks suggested by this book?

The Visible Ops Handbook and MFCF