Things That Kill Gear
Protecting existing server rooms is difficult because most legacy server rooms are dissimilar and the threats are multiple. One tried-and-true configuration may not work in another server room. This chapter will show a variety of methods to monitor server rooms that have proved effective in five years of instrumenting server rooms.
The five fundamental threats are:
- Heat, internal and external
- Water
- Fire (Smoke)
- Power Failure
- Intrusion
Heat Created by Equipment
Almost all the electricity consumed by computer equipment is converted to heat. One feature of heat is it's desire to quickly distribute itself. With multiple cabinets and dozens of fans, the heat becomes uniform through the room. The room feels hot, but there is no glowing stove to demonstrate the amount of heat entering the room.
If you moved around with a hand-held thermometer, you would see hot spots, some over ten degrees above the average temperature.
Imagine the equipment replaced by floor heater, the kind found under almost any desk in Minneapolis. A typical heater produces 500 watts and can easily take the edge off a cold cubicle. Now imagine each piece of equipment replaced with a floor heater. A typical PC uses about 300 to 400 watts, not including the monitor. For the sake of easy multiplication, let's say your server room has 30 devices, each consuming 500 watts, for a total of 15,000 watts.
Cabinets full of floor heaters. An illustration to graphically emphasize how much heat is produced.
If that number doesn't seem impressive enough, try visualizing 30 floor heaters in the room. Turn all the heaters on and come back in 30 minutes.
How hot would the room be?
Depending on the room's size and insulation, the temperature could rise 10 degF in less than 30 minutes. In an hour the room could be over 100 degF.
Computer gear converts almost all the electricity into heat. There is no water pumped or logs sawed (work); other than the internal fans and the disk head arms, 90% of the electricity converts to heat.
A heat gain calculation showed that the internal temperature of the room would increase about 15 degF every thirty minutes. The room rapidly becomes an oven. In most legacy server rooms, the air conditioning system has no margin for error; at best, most legacy air conditioners can barely cope with removing the internally generated heat. There is small or no performance margin.
If the air conditioning partially failed, the room will heat up, and the equipment's maximum operating temperature could be exceeded within only a few hours. If the air conditioner fails completely, the room could be in melt-down condition in as little as an hour, ruining the equipment.
Temperature Inside the Equipment Case
At a recent trade show, two Cisco-certified (CCIE) network maintenance technicians commented on the importance of keeping the equipment within operating temperatures. Here are their remarks:
"It's more critical than many IT guys think," said the first technician. "A typical rack-mounted device can run about 20 degrees hotter inside the unit than the outside temperature."
We told them about our meltdown worries.
"Happens a lot. Heat is the big killer. If the gear doesn't fail outfight it gets flaky."
"What's 'flaky'?" we asked.
"Some of the integrated circuits become intermittent because of the over-temperature operation. The silicon junctions in the integrated circuits can't dissipate the heat and they become unreliable, even after the gear cools down. You now have unpredictable equipment.
"For example, a dependable server suddenly becomes unavailable. You can't see the site or read a disk file. Somebody gets mad and starts calling you saying the server's down, so you go racing around looking for server problems. Just as suddenly, the server reappears and the complaints cease. You go nuts. This happens every week. You move stuff to different servers, you drive everyone crazy by pinging the gear every minute. You have flaky gear." He obviously spoke from experience.
The second technician added, "If we have a performance guarantee in our maintenance contract, we can insist on replacing our gear if the room temperature goes above 95F for more than an hour. Once gear gets flaky, it can drive our costs up and our reputations down."
Removing the Heat
Taking the floor heater example one step further, let's cool the room with a dedicated air conditioner. Note the word dedicated - many legacy server rooms use existing the building's existing air conditioning system, which adds more complexity because the heat gain in other rooms may change, affecting the air conditioner's performance. A common example of this is when an office copier is added to an office adjacent to the server room. A large copier can produce 1,000 watts of heat when in operation, which the building air conditioner has to remove.
15,000 watts of heat needs two tons of air conditioning to remove. (See calculations in Appendix.) To put this into perspective, a two-ton air conditioner will cool a three-bedroom home to 72 degF in an Arizona summer.
Added to this calculation is the external temperature. Hot walls require air conditioning. A stand-alone building in a Michigan winter has different cooling requirements than a similar building in a Florida summer. To simplify our example, we will assume the outside temperature is 72 degF.
Since we only have about an hour before the equipment is damaged, we need to instrument the air conditioner to get the earliest warning. A well-running air conditioner should have a 20 degF difference between the air inlet (suction) and the output (discharge) sides of the evaporator coil. If the filter begins to clog or the refrigerant starts to leak out, knowing the difference between these two temperatures may provide a week's notice that there is a problem. Having temperature graphs is important here.
The air conditioner removes the room heat and the equipment heat. An air conditioner that is dedicated to the server room is preferred.
We recommend measuring the temperature in four locations. The first location is at the wall thermostat, so we would know what the thermostat was seeing as it controlled the compressor and fan. A locked cover for the thermostat is highly recommended; one accidental brush of a cleaning person's vacuum hose is all it takes to turn off the air conditioner.
The next location is the hottest part of the server cabinet - between two "pizza-box" servers - is a good start. One IT manager uses a hand-held infrared thermometer to find the hottest locations, then places sensors in those hot spots. The hottest locations will be the first temperatures to rise quickly if the cooling system fails.
The air conditioner should be instrumented in two ways. First, the inlet temperature should be measured. A common way to do this to tie-wrap a Remote Temperature Sensor to the air inlet grill in the ceiling. In the same manner, attached a sensor to the air-conditioning output grill. The difference between these two reading reflects the efficiency of the unit. Some IT managers, particularly in the southern states, recommend monitoring the outside (ambient) temperature as well, in order to see what the building was subjected to.
| Location | Purpose |
|---|---|
| Wall | Thermostat |
| Cabinet | Hottest |
| A/C | Inlet |
| A/C | Outlet |
| Outside | Ambient |
Since the air conditioner is vital, experienced IT managers know that an emergency monitoring and repair plan is essential. First, place multiple personnel on the alert list. If somebody is on vacation - and someone is always on vacation or sick when alarms come in - the backup personnel will get the alarm.
Every IT manager who has lived through an air conditioning failure agrees that a service contract with an reliable A/C repair company with backup personnel must be established. The repair service must have 24 hour service. One IT manager even goes so far as to keep spare parts for his air-conditioning systems in-house.
Many IT managers recommend monthly reviews of temperature logs downloaded into a spreadsheet for analysis. Many things can happen during weekends and holidays, and the logs will show if the room has undergone a spike in temperature.
Monthly testing of temperature sensors - such as heating the them with a hair dryer to deliberately trip the alarm - is also highly recommended.
The logic behind all these recommendations is to get the earliest warning possible. High temperatures in the spaces between the pizza-box servers could indicate a problem; once the air around the wall thermostat or the A/C inlet gets hot, it may already be too late to save your equipment. Remote temperature sensors can have wire runs of hundreds of feet.
One well-placed temperature sensor can deliver an hour's early warning, enough time to get a repair technician on site.
How UPS Helps Cook the Gear
Ironically, a UPS unit can help fry your equipment. While a UPS can keep the server equipment running when main power fails, it doesn't keep the air conditioners going.
While keeping the equipment running during a power failure may make the users happy, it also means that the gear will keep making tons of heat. As we saw before, in as little as one or two hours the equipment could fry itself into scrap.
If main power fails for more than a few minutes, the safest thing for the server room equipment would be to shut each down until the air conditioning was restored. The servers should be gracefully shut down; programs exited, files closed, and the machines powered off. While an hour or two without service may frustrate your users, just imagine how much worse it will be if your system is off-line for days or weeks because you had to re-build the entire server room after the equipment cooks itself!
Hot Spots, Temperature Variations
When the thermostat on the wall reads 72 degF, the natural tendency of most people is to assume this means that the room is 72 degF everywhere. But of course, this simply isn't the case; temperatures can vary by 10 degrees or more in a room, and variations of 20 degF inside server cabinets are common.
One WeatherGoose user attempted to check the accuracy of his remote temperature sensors by arranging them on a eight foot long workbench located against one end of his server room, spaced equally apart across the bench. He discovered, to his surprise, that one end of the work bench was 8 degrees hotter than the other end! At first, the user suspected faulty sensors, but by placing an industrial digital thermometer at each end of the bench, the user verified the temperature difference was real.
A combination temperature, humidity, and air flow sensor with mounting clip.
To minimize the "workbench" effect he tie-wrapped all six temperature sensors together so that they would all be reading temperatures in the same spot on the bench. Once tied together, all the temperature sensors read within half a degree F of each other.
Another IT manager, who built a 50' x 50' server room (large by our standards, small by data center standards), related her experience:
"We have hot spots in our data center. I can show you a 10' wall section 15 degF higher than the rest of the room. I'm putting in an additional air conditioning duct to hit that spot," she commented.
These experiences illustrate the importance of monitoring your server room's temperature with several sensors at multiple locations - otherwise, equipment on one side of the room might be running hot while equipment on the other side is well within normal limits, and you might never know until it's too late.
Using A City Power Monitor: Is the Power On?
Many users place a UPS between the city power source and their server equipment, so that the UPS serves as a power conditioner as well as providing backup power.
The downside of this configuration is that an off-site system administrator may not know his system is running on batteries until the batteries are exhausted and someone calls to ask why his e-mail is down. On top of that, the equipment will be running with no air conditioning, and we have already seen how much damage that could cause if the situation isn't dealt with promptly.
The City Power Monitor emits a 5 VDC signal when power is present. When power fails, an email alarm is sent, which gains extra time.
Or perhaps you're not actually losing power for long periods of time, but simply experiencing brief blackouts lasting a minute or less. (Faulty breakers, bad wiring, unreliable power from your local substation - who knows?) If the building was unoccupied, who would know the power had ever gone out at all?
ITW offers a small accessory called the City Power Monitor for just these kinds of scenarios. Even if the power is only lost for as little as a second or two, the CPM will cause the WeatherGoose to log the event and send out an alert to the system administrator.
Backup Air Conditioners
From a peace-of-mind perspective, the most secure method of preventing a meltdown is to have a backup air-conditioning system that is completely separate from the primary system.
Whether a stand-alone portable unit or a duplicate in-ceiling machine, this is a practical way to keep equipment in operation while the primary air conditioner is being repaired.
One user reported he had installed a backup cooling system from a contractor built from used components for $5,000, including installation.
We mentioned this to a local IT manager and she said, "A back-up air-conditioning system, which I still don't have, is next on my budget list. By the way, you're right about the UPS helping cook your gear. I hadn't thought about that until you mentioned it."
Graphing Key Elements
After watching dozens of server rooms, we came to the conclusion that any variable, such as temperature and humidity, tells a more valuable story with the information graphed. It's not the absolute value, however accurate, that matters most; it's the trends in those measurements that tell the story of what's happening in your server room.
Water is Everywhere - Floor, Ceiling and the Walls
One IT manager we spoke to mentioned he had survived a fire-protection sprinkler accidental turn-on incident. He tells about his experience with water:
"The sprinkler just turned itself on. Don't know why. A software guy walked in right as the sprinkler started spraying down on a line of five server cabinets. A maintenance guy found a big Styrofoam food cooler in a closet and climbed up on a chair to place it directly under the sprinkler head. He wrapped his sweater around the sprinkler head to help catch the water.
"It took ten minutes to figure out how to shut off the sprinkler water. Before they shut it down, three developers emptied the food chest eight times into trash cans. Could have been a disaster.
The circles show the range of the overhead sprinklers. Note the potential water sources in the wall and under the floor.
"We started counting water pipes around the server room. Water pipes were in the ceiling, the walls and under the raised floor.
"A slow flood could take a hour maybe but the sprinkler eruption would be catastrophic. What we needed was a permanent version of the Styrofoam cooler to catch sprinkler leaks and other overhead leaks. The illustration shows the sheet metal tray we built to collect the ceiling water.
"The metal tray was simpler to build than we had imagined. A local air conditioning sheet metal shop made two trays with hangar brackets and half-inch drains for less than $150. We punched pencil sized holes in the acoustic tile ceiling and hung the catch trays over the cabinets. The drain was a real problem. The maintenance guy for the building helped us route the garden house to a sink drain on the first floor."
A 4' x 3' metal tray suspended over the server cabinets catches sprinkler water and drains it away, providing low-cost insurance against sprinkler or other water leak damage.
Water Sensors
Most water sensors measure the difference between the conductivity of air and the conductivity of water. ITW water sensors have a low voltage applied to some metal brads on a plastic case.
When the water touches the brads, it completes the circuit and the current begins to flow. If the metal brads are touching a surface such as a concrete floor, the graphs will show a decrease from reading 99 (dry, no current flow) to 80 (damp) to 55 (full conductivity).
A water sensor turned upside-down to show the metal water-sensing brads.
The water sensors plug into the C123C analog-sensor ports on WeatherDucks and WeatherGeese. Up to three water sensors can be connected and monitored individually. More can be used by wiring multiple sensors in parallel; the only disadvantage of parallel-wired sensors is that you can't tell which sensor is wet, only that one of them is wet.
Make sure that the surface that the sensor is placed upon is non-conductive. Since the sensor detects water by electrical conductivity between the metal brads, if those brads are in contact with a conductive surface - such as a metal tray - the sensor will always show full conductivity, wet or dry.
A water sensor mounted on a piece of vinyl for insulation. Note that the metal water sensors are face-down.
Installation rules for water sensors:
- The sensor must face down - metal brads against the floor.
- Placement locations:
- Water collection trays such as the trays below sprinklers. Don't forget to insulate the sensors with a piece of vinyl floor tile.
- Lowest point on the floor or below raised flooring. Find a lowest place on your floor. Spill some water and see where it puddles in your server room. That's a perfect place for a water sensor.
- Below a water pipe junction where the chances of a pipe leaking are good.
- In back-up air-conditioning condensation trays. If the primary tray's drain clogs up (very common because of algae growth) and the water flows into the back-up tray, once the back-up tray is full the next path for the water may be right onto your server cabinets.
- Place a heavy weight or clamp on the sensor so it cannot move. We have seen some sensor installations where the connecting wire had raised the sensor to 3" off the surface. The water would have had to reach 3" in depth before the sensor would signal the alarm.
- Test the sensor. Dunk it in a glass of water and see if you get an alarm. Wipe it off, then place it on a wet napkin and note the reading.
- Make sure the sensor is installed on an insulated surface. If the sensor must be placed on a metal surface, such as inside an air-conditioner's condensation tray, place a vinyl floor tile under the sensor, in between the sensor and the galvanized metal.
Routine testing of water sensors is essential. Unlike temperature, which can show revealing trends over time, water is likely to be an all-or-nothing event. Periodically dunking water sensors in a cup will confirm the sensors are operational.
Correct orientation of the sensors is also essential. We were looking at another installation in a data center. I asked the tour guide to see the water sensors, and he lifted a floor tile to show me a sensor he had installed near a floor drain point. The sensor had been installed upside down. Be sure the metal sensor brads are against the floor. Nylon tie-wraps are a good way to restrain a water sensor.
Testing the Alarms - Do They Work?
We asked customer if they had ever received alarms from their monitoring equipment.
"No," was the answer, "we have good infrastructure."
This is not a good answer. In a collection of two dozen sensors of mixed types, it is highly unlikely that you would never see a single out-of-bounds condition, even if it was just a false alarm. If you go for weeks at a time without a single alarm being raised, your alarm set-points may be set too generously, or your sensors may not be located in the best spots to give you early warning of potential problems.
Try setting you alarm limits closer to what you think the "normal" operating environment should be, and consider surveying your installation with a hand-held infrared thermometer, to see if there are any hot spots you might be missing.
Smoke Alarms
Most buildings have existing smoke alarms. The problem is that when they sense smoke there may be no one around to hear it. That problem is solved with an ITW smoke alarm which interfaces directly to the climate monitors.
A complete smoke alarm kit, ready to go. Plug the cord into the wall and the control wires in the climate monitor.
Many smoke alarms have a third wire that enables one smoke alarm to set off other alarms, such as in a hallway or long building. ITW uses this extra wire to relay an alarm to a WeatherGoose. The smoke alarm still operates in the normal mode, but now an e-mail or page can be sent in addition to the loud buzzer alarm.
Door Sensors
These tiny sensors, long used in the security industry, have two parts, a magnet and a magnetically-activated switch. If the magnet gets close enough to the switch, usually within a inch or less, the switch closes or opens, depending on which type you select.
These sensors offer a low-cost option to monitor whether the doors to the server room or equipment cabinets are opened or closed. Server and UPS cabinets are excellent candidates for this kind of monitoring, since under normal conditions no one should be opening those cabinets without the IT administrators' knowledge.
These little switches can tell a door's position. They are the same kind used in building security systems.
Depending on your server room's location and security requirements, you might also place sensors on any doorways allowing access to the room - although this option should be considered carefully; if personnel frequently come and go during the day, the steady stream of alarm mail will quickly become an irritant. However, such alarms can be valuable for weekend monitoring; if a door sensor goes off on a Sunday, when the building is supposedly unoccupied, something is wrong!
Note that door sensors will take some carpentry skill and various hand tools to install correctly.
Light Level
The WeatherGoose has an internal light-level sensor, which can show you at a glance if the room lights are on or off. Reading the graphs of the light conditions can show you when, or if, the room is occupied at various times of the day, or whether someone may have been working in that room over the weekend.
Some innovative users have run fiber-optic light pipes from alarm lights on the buildings' alarm-control panels into the WeatherGoose's light sensor, to make the WeatherGoose send them an e-mail alert if the building systems went into alarm state.
Sound
An internal microphone measures the sound levels every five seconds and stores the peak value. (It does not actually record the room sounds. It measures and remembers the highest audio peak values, sampling every five seconds. The purpose of the sound-level reading is to tell if any general change in the noise of the server room has occurred, such as fans shutting off, or audio alarms sounding, or personnel being in the room when they shouldn't be.
This function can be useful in many ways. Several users reported a substantial increase in sound levels when UPS units went into alarm mode and sounded an alarm horn, for example. Another user reported that you could detect bad bearings in cooling fans; the low frequency rumble of the failing bearings is easily seen as an increase in noise on the graphs.
Power - How to Keep the Juice On
The nightmare of power strips stems from the internal circuit breakers in the power strips. An electrician explained there are two types of circuit breakers in power strips, thermal and magnetic.
In most installations, web or e-mail servers are expected to turn-on and stay running. If the power consumed exceeds the limit of the circuit breaker, the entire computer system will be abruptly disabled. There is no advance warning for breaker tripping; one day the all the equipment plugged into the affected power strip goes off. The IT manager will wonder if the normal operating load (amps) was exceeded or the circuit breaker became defective.
Suppose a power strip is rated at 15 amps. If the strip uses a thermal breaker, and the total load on the power strip held steady at 16 amps, the thermal breaker would eventually trip and turn the entire strip off. But if the strip uses a magnetic breaker, and a power surge (higher voltage) occurred which briefly raising the current above 15 amps, the faster-acting magnetic circuit breaker would trip even if the normal load of the equipment connected to the strip was considerably less.
Many IT managers create a policy to not exceed a power strip's rated capacity by 60%. A 20 amp power strip would be restricted to 12 amps. The hard part is learning how much current is going into the strip.
An electrician can measure the draw of each power strip by powering down the existing power strip and inserting a break-out box, or by using a hand-held clamp-on ammeter. But the usefulness of these measurements is usually short-lived, because equipment tends to migrate from cabinet to cabinet and new equipment is always being added. If the breaker trips, a trip to the server room is needed to reset the breakers, and the IT manager has to determine why the breakers tripped. Usually, this is not easy.
There are two ways to monitor current in existing power systems:
- Current Transformers
- In-line Current Meters
Current Transformers
A current transformer (CT) surrounds a single current-carrying wire and converts the current into a 0-5 VDC signal suitable for input to the Weather Duck or WeatherGoose C123C analog-sensor jacks. The CT supplied by ITW can be selected for 30, 60, or 120 amp scales with a slide switch.
The most common place to mount a CT is around one of the wires entering or leaving the breaker box. It cannot be used around power cords unless a single power conductor is extracted from the cable.
Current transformers clamp around a single power-carrying wire and convert the current into a 0-5 VDC signal suitable for the C123C port of a WeatherDuck or WeatherGoose. This model has swing-open jaws to clamp around an existing cable.
In-line Power Meters
A PowerEgg inserts between a power source, typically the wall power receptacles, and an existing power strip. Internal circuity measures amps, volts, and watts along with a number of other variables such as power factor. The units come with a number of power plugs and receptacles, with a maximum load of 20 amps.
A back-lit LCD continuously displays the current measurements, and an RJ-11 connector permits the unit to send its data to a WeatherDuck or WeatherGoose for logging and trend graphing. and alarming
A simple way to monitor existing power strips is to insert a PowerEgg2 between the power source and an existing power strip. Both receptacles can be controlled via the web.
Three Current Transformers monitor the amperage in the three-phase power source. Three PowerEggs monitor the current going to individual power strips. The WeatherGoose web-enables the data.
Dual Power Strips
Many devices, such as Cisco routers, have A and B power inputs. If you haven't used equipment with two power inputs, the utility seems vague until you need to power down an entire power strip to add or remove equipment. With dual power-inputs, the equipment automatically operates off of whichever input is "hot", switching between them without interruption. If the equipment you plan to install in your server room is available with this feature, we highly recommend it.
PowerGoose
PowerGoose, a WeatherGoose with internal power measurement, is an excellent alternative to current transformers and PowerEggs since the power measurement capability and ten 15-amp receptacles are included in one convenient rack-mounted box.
Power Trends Seen in the Graphs
The ability to measure and graph will sometimes show some surprising variations in voltage changes. Normal line voltage in the U.S. is supposed to be 120VAC - and as long as the lights are on and everything is running, it rarely occurs to us that it might be anything else.
But when you monitor and graph your power over a period of days, you just might be surprised at what you find! One customer discovered, after installing his PowerEggs, that his incoming line voltage sagged at least twice a week to about 100 volts - almost brown-out condition. Once again, graphs tell the story.
The Cisco installer technicians we spoke to said their gear starts shutting down automatically about 105 volts. They recommend running all critical gear off the UPS units, which are far more forgiving of voltage changes and produce a steady voltage output.
We recommend all server room power be on dedicated circuits; that is, the wires should run from the power transformer on the power pole outside the building directly to the computer equipment inside, and the sever room should have its own breaker box independent from the rest of the building circuits. Many facility managers oppose this due to the expense, but the benefits can easily outweigh the additional expense.
Otherwise, your server room and all of your valuable data could be at the mercy of anyone who might plug in a copy machine, a floor polisher, or even an arc-welder without realizing the wall socket they used is on the same circuit as the server room.
Video Cameras
The features of low-cost video cameras are remarkable. We recommend every server room use a video camera. The addition of a video image to server room monitoring gives an added dimension. If an alarm is sent, a quick look at a Web Browser tells you if the lights are on and who is in the room, something you would only have found about by driving over to the room.
We recommend and support the D-Link DCS-950. The camera produces superb pictures, and is supported by the ITW climate monitor and Console software.
The D-Link 950 provides excellent still and motion pictures plus motion detection with e-mail alarms. The threshold of detection can easily be adjusted.
The D-Link camera also offers motion detection with e-mail alerts, making it very useful; if something moves, you'll soon know what or who it was.
Once installed, you can get an e-mail if the camera detects motion in the server room. The DCS-950 can detect motion in any of three present fields. This feature was previously available only in cameras costing over a thousand dollars.
The cost of one trip to visit a remote server room is well worth the price of these cameras.
One low-cost webcam can monitor three zones in a server room. The threshold of detection can easily be adjusted. If the camera detects motion an email alert is generated.
The Well-Monitored Server Room
Whether you have two server rooms to manage, or two hundred, it's nice knowing you have a good chance coming out of a climate failure with minimum damage. Remember to test the sensors.
A fully-monitored room. temperature, humidity, air flow, doors, and power are monitored. The video camera is not shown.
Written by the staff of IT WatchDogs, who have helped users install thousands of WeatherGoose and SuperGoose server room monitors.