Perhaps the most critical equipment and systems in any critical facility are the monitoring and control systems. Most sites have invested significant money and resources in selecting, designing, and installing robust computer-based monitoring and control systems. These systems are often referred to as building management systems (BMS), building monitoring systems (also BMS), building automation systems (BAS) or for electrical infrastructure, electric power monitoring systems (EPMS). Specialized high-speed data capture EPMS systems used for forensic time-stamping and wave-form capture are called power quality monitoring (PQM) systems.

Typical office buildings frequently use direct digital control (DDC) systems that rely on “canned” devices that are designed for specific applications such as controlling pumps, fans, air handlers, etc., such as DDC systems. These systems require the least amount of customized programming and setup. A distributed control system, or DCS, features actual control programming logic that is distributed to multiple field devices that report back to a central computer work-station.

Systems that use more powerful and capable controllers that require a larger level of customized programming are called programmable logic controller, or PLC systems. The most robust and customizable systems with the broadest capabilities, that also require the largest amount of programming, are called system control and data acquisition (SCADA) systems and are typically found in more complex industrial applications. I’ll defer to others for an in depth discussion on the differences and respective pros and cons of these different type control systems. The focus here is on industry “best practices” for managing the operations and maintenance of these critical systems.


While most of these systems appear similar to the enduser operator in regard to capabilities and functions, the actual capabilities and functions deployed and used varies greatly from site to site. Most of these systems have extensive capabilities that remain unused unless the site staff has in-house system specific expertise (i.e., system administration, programming, and certified technician trained staff) or unless the installation contract called out in detail exactly what capabilities and functions will be setup, activated, configured, etc. This is due in great part to how the systems were originally specified in the construction project contract documents, in how well the installation, programming, and staff training were executed, and in what kind of post-installation vendor support services get retained. It can be helpful to consider the two main functions of these systems; namely monitoring and control.

Control is mostly invisible. The control system takes in a myriad of input signals and sends out corresponding outputs directing actuators to take action to maintain conditions in normal ranges. The programmed logic and resulting outputs are based on defined normal “sequence of operations” (SOO). When the input signals indicate the conditions become abnormal (too hot, loss of power, low frequency, low static pressure, etc.), the control system initiates emergency actions based on emergency SOOs (start redundant equipment, close dampers, transfer power, etc.). The key to sound, stable, and effective control is based on comprehensive startup and testing during the commissioning of the site infrastructure and controls. It is very important that the testing include normal and emergency modes of operations and at various load profiles. Many control systems that are stable during low loads become unstable at higher loads, and vice versa.

Monitoring is the aspect that is most visible and where operating staff interact with the system the most. In most cases, the majority of monitoring occurs at the main, central computer work-station where the field devices and panels report to (frequently referred to as the “front-end”). The front-end is typically located in a facilities command center (FCC) or other office space where the facility management staff reside. The front-end should be capable of displaying pre-programmed “graphics” and other displays on multiple screens (the more, the better). Properly designed FCCs have three or more large wall-mounted flat-screen TVs at a height where anyone in the room can easily view the displays. There should also be multiple smaller screens including work-stations for display of additional data and graphics. The purpose is to be able to monitor the most critical systems and site conditions concurrently as well as any/all alarm conditions, trends, and performance parameters. There should be sufficient screens to be able to monitor specific systems or equipment of concern as well as any critical operations at risk.

A typical best practice is to have the site electric power system displayed in the form of an active power single-line that shows the real-time equipment status and breaker positions, power meters, and the energized power paths. A separate display should show the central cooling plant with real-time equipment status and valve positions, temperature sensors and flow meters, and flow paths. The third display would normally show the data center computer room(s) including the status and key parameters associated with the power (PDUs, RPPs, etc.) and cooling (CRAHs, humidity control systems, outside air units, etc.).

Other monitors display critical data or system statuses such as recent alarms including active-unacknowledged, active-acknowledged, and archived (resolved) alarms, pertinent trend reports, temporary system control “overrides,” and other useful data. An important aspect of critical systems and equipment that should be monitored is the availability of standby equipment to respond in an emergency or anomaly condition. Hand-Off-Auto (HOA) switch positions should be monitored and should result in an alarm if not in the “auto” position. Local-remote selector switches should be monitored (such as for standby generators) and should generate an alarm if not in the “remote” position.


An absolute necessity for critical monitoring systems is the ability to transmit alarm and anomaly notifications remotely. In the most basic scenario, the notification is in the form of a visual and audible annunciation (horn and strobe, etc.) at the “front-end.” This may be sufficient provided the front-end is monitored continuously.

In most cases, critical monitoring systems have the capability of transmitting notifications electronically to smart-phones, pagers, palm-devices, etc., via SMS texts, emails, or other electronic media so multiple staff members can be notified concurrently including off-site personnel. This becomes a fundamental requirement for sites that are not staffed 24/7. Many sites also have remote monitoring stations located in the IT network operations center (NOC) and the security command center (SCC) for added protection with standing policies that alarms that remain unacknowledged past a set period are manually escalated. Regardless of the alarm notification protocol, the system should include periodic test “alarms.”

Ideally, each critical monitoring and control panel should periodically generate a virtual change-of-state for a “test” point and have that alarm communicated back to the front-end and transmitted by text, page, email, or however out to the assigned recipients. Shift staff should receive a test page at least once per shift. This ensures the system network, transmitter, and personnel devices (pagers, smartphones, etc.) are all operational and connected.

Many critical site monitoring systems handle and display such large quantities of information and data that it becomes a challenge to avoid “information overload” of the operating staff. Well managed systems incorporate basic conventions such as any display with flashing red indicates an alarm condition. Anomaly and alarm conditions are categorized by severity and assigned a criticality level. The lowest level (least severe) conditions are reported to the front-end only. The next highest criticality level is routed to the on-site operating engineers. The next highest level is routed to the operating staff as well as supervisors and facilities management. Life-safety related alarms are routed to facilities and security as well as select others (property management, site directors, clients, etc.). Alarms indicating critical operations are impacted or in imminent jeopardy have the highest priority and would be routed to executive management, IT directors, business continuity and disaster recovery staff (including remote backup sites), etc.

Another best practice is to design in “alarm filtering” protocols that allow the monitoring system to automatically filter out low-level alarms and notifications when anticipated major events occur. An example would be for a utility power outage event. When a site loses all utility power a large number of alarm events occur that are expected and normal such as chillers, pumps, and cooling towers shutting down, low-flow conditions, breaker operations, standby generators starting, etc. These events can result in many and even hundreds of alarms, all of which are routed to the front-end and into a queue to be transmitted out as pages, texts, emails, etc.

Even the FCC alarm display almost instantly receives so many alarms the screen starts scrolling. In these situations it is crucial that operating staff be able to identify any unexpected conditions (such as generator failure to start, UPS on extended battery run, or high secondary chilled water temperature, etc.) and know where manual, human intervention is required. Alarm filtering protocols can filter out the expected alarms for the operating staff so they immediately recognize the systems respond normally, or what emergency actions are required to avoid mission impacts.


Monitoring and control systems, like other critical infrastructure, need proper maintenance to remain reliable, deliver optimal performance, and to prolong their useful life. Due to the specialized skills and knowledge required to upkeep these systems they are best maintained by a qualified service firm with the system-specific expertise and critical facility awareness. The best strategy is to develop a maintenance program and award a service contract as a structured service-level agreement (SLA). The first order of business is to define the qualifications and certifications of not only the service firm, but also the technicians that can be assigned to the account. The exception to this rule is when a site has an in-house monitoring and controls staff (of more than one person) with the requisite skills and knowledge to provide comprehensive maintenance services in-house. Even in such cases a limited support contract may still be advisable to ensure access to the manufacturer’s spare parts, software updates and patches, and technical support.

The next step is to define the actual tasks that will be performed by in-house staff and by the service firm. Tasks to be performed in house must be matched by the expected level of in house capabilities and supported by site and system specific training and procedures and could include alarm acknowledgment, setpoint adjustments, developing scheduled activities such as equipment swap-over to equalize run times, assigning un-occupied times for office spaces, and setting up trends for specific data points, etc. Tasks that are typically outsourced (via the SLA) would be those requiring more technical expertise such as running diagnostic routines, adding devices, annual calibration of sensors, and updating firmware, software, and virus/malware protections. Other critical tasks include archiving programs and historical data, modifying graphics, adding new points, reviewing transaction logs, and adding/deleting/or modifying access levels for employees.

For sites that lean toward outsourcing the bulk of their monitoring and controls maintenance, it is still advisable to have at least one in-house expert who can manage the contractor and administer the SLA. The monitoring and control systems should be included in the overall facility maintenance program (such as any asset/inventory management system and computerized maintenance management system) and periodic routine tasks should be scheduled and tracked just as other critical system maintenance is.

As with other critical infrastructure, monitoring, and control system operations and maintenance tasks should be supported by clear and concise standard operating procedures (SOPs). These should include prerequisites such as backing up processes and software prior to making changes (as a contingency back-out plan), authorizations and approvals required before executing risky procedures, and testing program and software modifications and revisions on a system simulator prior to downloading them onto production systems. The SOPs should also include emergency operating procedures such as how the staff should respond if the front-end fails, if critical field panels fail, if the controls network fails, and perhaps most importantly how the site can be operated manually if the critical controls fail or must be taken off-line for replacement or other reasons. Each procedure should be tested and verified and all appropriate staff should be trained and drilled on the execution of each procedure as applicable.