Purdue's supercomputer is among the most powerful commercial machines in the world.
In Mike Shuey’s line of work, turning off a computer means losing up to 3 million processing hours—a few weeks of work, gone in an instant.
Until Shuey and his team of supercomputing experts at Purdue University found a way to cool down the massive machines when they overheated during the blazing summer months, there were only two options: “You turn on a few fans and hope for the best, or you turn off [the supercomputers] and wait until the temperature stabilizes,” he said.
Shutting down the machines would save the university’s expensive computers—but it could cost researchers weeks or months of work.
Some research conducted with Purdue’s supercomputers requires several months of continuous operation, and pulling the plug would force researchers and scientists to start from square one.
“They have to start all over again where they began weeks or months ago,” said Shuey, Purdue University’s high-performance computer systems manager for 10 years.
But computer code written by Patrick Finnegan, a Unix systems administrator at the university, has allowed Purdue IT officials to slow down the supercomputers when they reach about 85 degrees Fahrenheit by about 30 percent—meaning researchers wouldn’t have to start from scratch after a complete shutdown.
Shuey said Finnegan wrote the code and created the cool-down program in the spring when a nearby utilities plant announced it would undergo maintenance in the summer, and meteorologists predicted higher-than-normal temperatures for the coming months.
Researchers working on the supercomputing systems will never notice a slowdown as Shuey and his team reduce power and bring the computers’ temperatures back to reasonable figures, usually in the low-70s.
“We can coast through the emergency for an hour or two and [then] ramp it up, and nobody’s the wiser,” he said.
The university has had to gear down its supercomputing capacity twice this summer, Shuey said, which—considering the alternative—was a no-brainer.
“It’s very much a risk-free option here,” Shuey said, adding that it takes about five minutes to slow down the computers when they overheat and five minutes to ramp them up when temperatures return to normal. “And sure enough, it has worked like a champ.”
The biggest hurdle was finding a way to slow down 8,000 processors at the same time. With so many processors involved, Shuey said, the chances of something going awry were high. That’s where Finnegan’s code came in.
“It can be a fairly simple thing in practice,” Shuey said. “We’ve just never seen it done at this scale.”