At Oregon State University’s (OSU’s) College of Engineering, interest in graphics processing unit (GPU) computing has been growing steadily. Students, staff, and faculty agreed that the university should fully embrace the emerging trend.
“Artificial intelligence [AI], machine learning [ML], parallel programing — those are all really hot items right now,” said Todd Shechter, director of information technology at Oregon State University College of Engineering. “We are seeing a lot of interest in the undergraduate curriculum space.”
As the College of Engineering developed a strategy for provisioning and providing high-power GPU computational capabilities, it worked with industry leaders NVIDIA® and Microway. The resulting investment in six new supercomputers has already dramatically improved the school’s educational and research capabilities, helping the ML and AI group to grow into an increasingly important presence.
The College of Engineering makes up about a third of OSU, with about 10,000 students, staff, and faculty. At the outset of the decision-making process, faculty and administrators gathered together to define the characteristics of an ideal campuswide computing resource. The university required enough GPU capacity to serve the diverse needs of undergraduate classes as well as research workloads, plus super-fast storage. The solution had to scale, yet it also had to represent a dramatic leap in capability.
The team opted for the NVIDIA DGX-2™ enterprise AI research system, partly because of Docker images that have NVIDIA’s containerized software, and partly due to technical support considerations. Mostly, the decision was made for what they considered its unmatched computational horsepower. Each appliance delivers AI performance of up to 2 petaFLOPS, an entire supercomputer worth of computational horsepower.
Once OSU identified the DGX-2 as the right fit, the team had to determine how many would be needed. The university hosted workshops with faculty, administrators, and NVIDIA to learn of plans to use the new technology for such areas as medical imaging, nuclear research, bridge construction, robotics, and driverless vehicles.
“What we learned is there is a lot of interest in GPU computing, but we had a hardware gap,” Shechter said.
OSU was not entirely without GPU capabilities before the upgrade, but its single-precision consumer-model GPUs were built for gaming and lacked both double-precision compute capabilities matched to critical scientific applications and an effective way to stitch GPUs and systems together.
Relatively scattered and without any means deployed to efficiently connect the resources, the preexisting GPU infrastructure was simply unwieldy. There was no existing means to even unlock the true aggregate power of the available systems.
All that would change. In addition to adding raw hardware to OSU’s GPU resources, the new DGX-2 systems would solve common scaling predicaments by leveraging NVSwitch scalable architecture and NVLink™ communication for high-speed GPU-to-GPU interconnects inside the appliances.
Each DGX-2 packs 16 fully-connected Tesla® V100 GPUs stitched together by these technologies. The result is an individual system that offers equivalent capability to dozens of existing GPU nodes or hundreds of CPU-only servers — enough for many individual jobs run on campus.
After factoring in myriad use cases from researchers, enrollment figures for technology-enabled undergraduate courses, and an analysis of the overuse of existing infrastructure, OSU calculated that a major upscale was warranted.
“So that the experience remains authentic and we aren’t trying to cram everyone onto a single machine, we came up with the number six,” Shechter said.
Unlocking Scale for Science
The proposed increase in scale was not only about capacity planning, it was also focused on unlocking new possibilities. The diverse needs of the community included requests to dramatically upscale their computational science.
That is where clustering of the DGX-2 systems came into play. Newly arrived Mellanox InfiniBand would be used to bridge between DGX-2s to enable jobs that required multiple supercomputers. Students and scientists would be able to unlock possibilities with larger experiments than ever before.
With the ability to link these systems, OSU did not purchase six functionally separate supercomputers, but a battery of computational muscle that can work as a cohesive unit. The cluster of six DGX-2s with InfiniBand connectivity offered OSU a linked network of 96 of the world’s most powerful computational engines. They could be put to use on a single problem for unmatched performance.
The improved fabric infrastructure stitched together computing infrastructure far more completely, with higher bandwidth and lower latency, than had been available in past computational resources. Larger datasets, more experimental attempts, higher precision in simulation, and requests for greater accuracy all became feasible across the campus community.
Deciding on a solution and getting it up and running are two separate and distinct achievements. OSU’s IT professionals had a number of questions regarding installation and setup of the new systems, particularly about power needs. What types of power delivery should be used? How do we make sure the amperages, volts, watts, and other details match the system’s requirements?
These were not trivial questions. Each DGX-2 consumes about 10 kW of power, the equivalent of over eight homes or three to four traditional server racks full of systems. While extremely efficient for the throughput delivered, it would require very careful planning to install the DGX-2s in the .
Microway Inc., a NVIDIA Partner Network HPC Partner of the Year, installed the DGX-2 deployment at OSU.
“Microway was really great at helping us through the nitty-gritty details,” Shechter said.
The teams collaborated to get each of these details right. Higher amperage power was ready at the racks, installation paths cleared, and space made available at the bottom of racks for the DGX-2 equipment.
The company worked on-site to rack the systems, integrate cluster software, properly burn the cluster in, and guide OSU IT in knowledge transfer about the systems.
“The goal was that when they left, we’d have a good understanding of how it all needs to work together,” Shechter said. “And we got there. The experience has been very, very positive.”
Supercomputing as a Tool
For OSU, the new DGX-2s optimize functionality and flexibility. They handle single-precision and double-precision workloads and crunch data in essentially any form. They are tied together in a cohesive computing unit when large jobs are required, yet they are also easily partitioned for smaller projects.
“We have a hugely broad set of users who will be making use of the investment, and the DGX-2 was the best solution to take care of all of their needs,” Shechter said.
The tools offer uses for electrical engineers and computer scientists as well as mechanical engineers, civil engineers, nuclear engineers, biologists, chemists, and others.
Importantly, OSU’s six DGX-2 systems will not be reserved for researchers, graduate students, and faculty. The world-class computational offerings will also be utilized as a pedagogical tool to undergraduates interested in experimenting with the most current technologies.
“When we teach an undergraduate class in parallel programing, machine language, or artificial intelligence, we have the processing power to back up what we teach,” Shechter said.
These are not idle words. Undergraduate education at Oregon State imparts knowledge and skills that result in great things. The co-founder of NVIDIA, Jensen Huang, graduated from Oregon State University with an undergraduate degree in electrical engineering in 1984. For the next generation of technologists from OSU, hands-on experience with advanced equipment from his company will play a key role in their introduction to the future of computing.
The DGX-2 is already changing the school’s researcher and student experiences. The ML and AI group continues to grow into an increasingly important presence on campus.
The enhanced scale at which they are able to model proves highly beneficial to users’ research. “I’ve had faculty members who have run their simulations on existing hardware and then run it on DGX hardware, and the difference just blows your mind away,” Shechter said. “We’re really hoping that this helps our faculty members produce results that they can then share broadly in their communities.”
From the student perspective, the resource provides opportunities that extend beyond the classroom. A student-led team developing a driverless electric car uses the DGX-2s for a large number of the simulations the young engineers must run to hone their designs and code. The student project had previously worked only on standard combustion engine vehicles, but the new computational abilities have enabled them to turn toward tackling the future of mobility.
Collaborating With Experts
OSU’s decision to invest in six new supercomputers is designed to improve its educational community.
“We want to attract the very best to our campus, whether student, staff, or faculty, and we think the DGX-2 is going to go a long way to show how serious we are about that,” Shechter said.
However, without expert assistance from experienced vendors, integrating the new technology into the research and educational workflows of the institution would be impossible.
“Enabling barrier-breaking cluster deployment is Microway’s prized skill,” said Ann Fried, CEO, Microway. “We are proud to have been selected to deliver the product and services for such an important cluster and to usher in the future of computing at Oregon State University’s College of Engineering.”
Microway’s knowledge of ability to train OSU IT professionals to operate the supercmputers played a key role in Oregon State’s future success in scientific publications and on the road with driverless vehicles.
“I hope there are additional opportunities to work with Microway,” Shechter said. “When you use the term value-added reseller, they truly do add value to the relationship.”