Since every critical facility is unique, many site-specific practices are required for successful operation. A leadership team may borrow general concepts from their experience with other facilities, but ultimately must develop a number of important processes independently. Without a common template, it is easy to overlook practices which will help ensure the performance record is optimal over the facility’s life.

Article Index:

Ten essential practices for critical operation success are outlined below, as a means of measuring your operation to identify needed enhancements.



Coverage should be dictated by a clear uptime objective for the data center. If downtime events can threaten the company’s existence, a minimum of 10 facilities operators are required to ensure at least two people are present at all times. Incidents occur with equal frequency, no matter the hour or day. A two person presence ensures work activities and emergency responses may be conducted safely and effectively.

If the company can survive an interruption to data processing periodically, a single weekday shift with five to seven operators may suffice, at half the cost of the continuous shift model. The offset will be a delay in response to incidents during evenings and weekends. Some avoidable cooling and generator system downtime will occur because operators are unable to intervene in time. Recovery from electrical system events will take more time.



Virtually every new critical facility project incorporates commissioning to validate each system performs properly and all systems interact as expected. Though this is typically a very busy part of the project, facilities operators should participate as observers. The facilities manager should predetermine which operators will be responsible for each infrastructure system once the building begins operation. No more than two operators should attend a specific system’s commissioning activity. This will minimize distractions to those conducting the activity and ensure the observer(s) stays focused. 



The phrase “site-specific” is critical to success for operations team training programs. Sending team members to seminars or factory training will be insufficient because system configuration and installation is unique to each building. Several practices should be included during new construction and two significant processes should continue repetitively each year the facility is in operation. A single team member should be assigned to develop and manage the training program.


New construction

  • The design engineering team (Engineer of Record [EOR]) should be engaged to present a half-day session on the design concepts for the facility. Included should be system overview descriptions, an explanation of how all systems should interact, when the operator should not intervene (when something behaves unexpectedly), and when the operator should step in. This information will provide the basis of informed decision making in the future, when an event inevitably occurs for which no one thought to develop a procedure.

  • Infrastructure equipment suppliers are generally contracted to provide training to the operations team at the end of the construction project. It is important to schedule no more than two training sessions in a week to avoid information overload. This timing must be planned at the project’s onset and understood by the general contractor.

Commissioning agent training is arranged by some building owners. This should typically follow EOR and equipment supplier training. A focus on how systems behave under various loads is beneficial as the new facility will initially see only light loads.


Ongoing (repetitive)

  • Each facility’s operations team has a simple opportunity to confirm each operator is equally experienced with isolating a system prior to maintenance or repair, as well as with restoring a system to service after the activity. Scheduled preventive maintenance (PM) events involving these system transfer procedures should be methodically utilized. A different operator should serve in the “hands-on” role each time that activity occurs on the calendar. This will necessitate two operators swapping shifts on the date of the PM, or that overtime be paid. The operator with the most expertise for a given system should serve as the trainer — guiding the hands-on operator through the process while following a written procedure.

  • Emergency response training may be accomplished by identifying all conceivable emergency events and scheduling training sessions monthly to address these. A one- to two-hour classroom session each month will typically allow your team to cover the number of potential events on your list each year.



By far the most frequently omitted process; practice time provides the only opportunity the facilities operations team will have to become fully confident with the most critical activities they will need to perform. Several astute management teams have successfully justified one to two months of dedicated time following the completion of construction and commissioning for this vital activity. By utilizing emergency response and system transfer procedures and working in pairs — just as they will on active work shifts — operators will experience each critical event scenario individually. The experience is invaluable and closely analogous to pilots training in simulators before flying an airplane the first time. We expect practice time to increasingly be scheduled as part of new construction projects, much like the commissioning process became an essential part of virtually every new critical facility’s schedule years ago. 



A number of management teams have developed orientation programs for their facilities operations team. Frequently included are detailed schedules for completing training with more experienced team members. On average, new employees are not assigned to a regular shift with their own responsibilities until they have completed at least six months “shadowing” individuals on various shifts. Both peers and supervisors are involved in reviewing and testing each new employee’s knowledge at scheduled intervals.

To accompany the shadow program, many organizations provide their employees an orientation manual. This will often include company and department mission, expected behavior, impact of downtime, an escalation list, an emergency call list for vendors, infrastructure systems descriptions, building one-line drawings, safety policies, and data center rules. 



Over 4,000 incident reports submitted by data center managers over 14 years revealed the majority of downtime is consistently caused by people working within computer rooms. The best countermeasure is to minimize the number of people authorized to enter the space and provide them clearly written definitions of task ownership. Defining separately which activities IT is responsible for and which facilities owns is crucial. Documentation should specifically identify where the facilities team completes their portion of power cable installations and where IT is involved in connecting power to the computer equipment, assuming both groups are involved.

The written agreement should identify how individuals from each group cooperatively plan the location of computer equipment within the room. Many management teams will also delineate expected involvement in strategy meetings, expected timing for updates to the other group when working an incident, and expected submittal of written incident summaries after an event.



This simple practice will eliminate many potential interruptions to computer operation, but requires an investment of time for the management team and consistent backing by senior executives. Requiring each individual who sets foot in the data center to review and sign a thorough list of data center rules in the presence of a manager dramatically helps to ensure a lack of awareness does not cause an interruption to the operation. Something as simple as a cup of coffee spilled has resulted in pain for numerous organizations. Rules should be reviewed with all who will ever enter, from entry level employees to the CEO, and with every contractor or consultant — no matter how brief their visit will be. Escorted tour groups should even be required to comply.



This process requires perhaps the greatest investment in time and resources of the practices cited in this article. It is also perhaps the most significant. Initial focus should be on the two critical categories of procedures. Critical facilities need emergency response procedures which address every anticipated scenario. For most, this will require between 50 and 100 documents.

Of equal importance are system transfer procedures, which detail the steps to safely isolate a system or component prior to maintenance or repair (or to restore the same equipment to service after the PM or repair activity). Seventy to 100 documents are commonly needed to address all infrastructure systems.

Preventive maintenance procedures, procedures for one time upgrades or modifications, and administrative procedures should all be given a lower priority than the first two categories described when document development is scheduled.

A single team member should be assigned ownership of the critical procedures program. This eliminates the potential for overlap, multiple formats, inconsistent phrasing, and other consequences that occur when more than one person is involved.  

Facility managers may contract procedures development to their design engineers or commissioning agents. If contracted as part of a new facility project, it is possible to receive completed procedures within two to three months of the completion of commissioning. It is important that each procedure be tested for clarity with your least knowledgeable operator for a given infrastructure system. This will permit you to identify where slight edits or additional details are needed before you consider the document ready for use.

Alternatively, or if you are well past construction, you may choose to have your team’s procedures owner develop the documents with the help of system experts. Experts may be design engineers, commissioning agents, manufacturers’ service technicians, and sometimes very experienced internal team members. Unless your procedures owner has no other tasks, this approach will entail a much longer schedule. Expect this effort to require more than a year, if you can allocate the procedures owner 30 or more dedicated hours each month. With help from an operations consultant, the timeframe is typically three to six months.

A consistent format should include descriptions of the procedures objective; associated risks; number of operators required; tools and safety equipment needed; notifications before and after; photographs with arrows to identify specific controls; check boxes for each step; and escalation plans in the event of a problem.

Only the procedures program owner should have the ability to edit the documents. All others should have access to PDF copies only. Previous versions should be destroyed each time a procedure is updated. Binders of emergency procedures specific to equipment in each infrastructure system room should be easily accessible within the room to save time when an alarm occurs.



Your investment in implementing the previously described processes will be at risk if you fail to retain those who successfully operate and support your facility’s systems. Continuity greatly contributes to continuous operation. In addition to your own staff, critical team members include contracted electricians, equipment service technicians, design engineers, commissioning agents, network installation contractors, and more. Familiarity with your facility’s unique configurations and processes spawns success. Introducing a new team member will always introduce risk.

Recognition, monetary rewards, swapping some responsibilities periodically within the operations team to keep things fresh, and promoting deserving individuals to “lead person” or “subject matter expert” roles are several means of keeping an established team content. It is important to verify your team’s pay and benefits are at least 10% to 20% higher than the metropolitan or regional average for critical facilities positions. Compare the cost of a downtime event if you receive push back on this.



Because your team works in the same environment every day, it is valuable to have your critical facility assessed every three to five years by someone else with extensive critical facilities experience. “As it looks through others’ eyes” is always enlightening. Unfortunately, over 95% of assessments are conducted reactively after the pain of a downtime event has occurred. No challenges are raised when assessment funding is requested then.

By proactively scheduling this effort on a regular frequency, management teams can accurately determine where single points of failure may exist, where system capacities are in jeopardy of being exhausted, and where processes may fail to match the operation’s objectives. Assessments will also highlight efficiency accomplishments and best practices observed. The discovery of deficiencies and risks, as well as positive practices, will provide justification for needed funding, in addition to reinforcing annual budget allocations.



You have likely implemented several of these practices or some that are similar. The intent of this outline is to prompt ideas which, when implemented, will enhance your team’s ability to deliver continuous operation. I encourage you to contact me with questions and additional practices you have found to be beneficial and I wish you continued success.