Born out of a groundbreaking regional high-performance computing (HPC) project, the Northeast Storage Exchange (NESE) aims to break further ground — to create a long-term, growing, self-sustaining data storage facility serving both regional researchers and national- and international-scale science and engineering projects.

To achieve these goals, a diverse team has built a regional technological achievement: New England’s largest data lake.

The story of creating this data lake is a lesson in cross-organizational collaboration, the growth of oceans of research data, changes in storage technology, and even vendor management.

Finding the right technology — hardware, firmware, and software — for such a large-scale project is challenging. Now that the project has launched, though, both the NESE team and industry partners like Microway are confident in their capacity to meet growing research computing data storage demands in a way that facilitates end-user buy-in and unprecedented collaboration.

The Beginnings of MGHPCC and NESE

The Massachusetts Green High Performance Computing Center (MGHPCC) brings together the major research computing deployments from five Boston-area universities into a single, massive data center in Holyoke, Massachusetts.

The 15-MW, 780-rack data center is built to be an energy- and space-efficient hub of research computing, with a single computing floor shared by thousands of researchers from Boston University, Harvard University, Massachusetts Institute of Technology, Northeastern University, and the entire University of Massachusetts system. Because the data center is run by hydro and nuclear power, it leaves nearly no carbon footprint. By joining together in the Holyoke site, all of the member institutions gain the benefits of lower space and energy costs, as well as the significant intangible benefits of simplified collaboration across research teams and institutions.

As of 2018, the facility was more than two-thirds full, at 330,000 computing cores total. The facility currently holds the main research computing facilities for the five founding universities, as well as those of teams of national and international collaborative data science researchers.

Naturally, an innovative research computing project like MGHPCC would require an equally innovative corresponding data storage solution. Enter NESE, supported by the National Science Foundation. The institutions involved are Boston University, MGHPCC, Massachusetts Institute of Technology, Northeastern University, and the entire University of Massachusetts system. Within a team of 25 from Harvard’s Faculty of Arts and Sciences Research Computing, NESE has a dedicated Storage Engineer. Scott Yockel and his team at the Harvard Faculty of Arts and Sciences Research Computing lead development, deployment, and operations of NESE for the whole collaboration. NESE is already New England’s largest data lake, with over 20 PB of storage capacity and rapid growth both planned and projected.

An Innovative Data Architecture

NESE doesn’t rely on traditional storage design. Its architects have instead chosen Ceph: an innovative object storage platform that runs on nonproprietary hardware.

By not relying on proprietary enterprise storage solutions and eliminating the need for individual research teams or institutions to manage their own storage infrastructure, NESE delivers MGHPCC storage needs economically and efficiently. It breaks new ground for cost and collaboration at once.

The project design has attracted notice: NESE was launched with funding from the National Science Foundation’s Data Infrastructure Building Blocks (DIBBs) program, which aims to foster data-centric infrastructures that accelerate interdisciplinary and collaborative research for science, engineering, education, and economic development.

In addition, NESE has attracted major industry partners who help the team achieve the goals of both individual projects and the NSF as a whole. Microway, which designs and builds customized, fully integrated, turn-key computational clusters, servers, and workstations for demanding users in HPC and AI, has supplied NESE’s hardware and will continue to partner with NESE as it grows. Additionally, Red Hat, the creators of Ceph, have been working with the NESE team from design and testing through to implementation.

Building NESE

Of course, such a large storage infrastructure has challenges that needed to be met — building such an immense data lake requires knowledgeable project management and partners committed to delivering a solution tailored to research computing users.

First, the research done at MGHPCC and each of its member institutions is highly diverse in terms of its storage demands. From types and volume of data to retrievability and front-end needs, the NESE team needed to account for many different users in building out new storage infrastructure. What’s more, the system needed to be easily scalable; while the initial storage capacity is large, the NESE teams expects it to grow rapidly over the next several years. Finally, with such a huge volume of data storage and large number of users, the system needed to be relatively failproof, so that outages do not affect huge swaths of data.

With these challenges in hand, the NESE team, including Saul Youssef of Boston University and Yockel. reached out to Microway for help in designing the ideal solution. Yockel and others at Harvard had previously worked with Microway for dense GPU computing solutions. Based on this trust, they gave Eliot Eshelman, vice president, Strategic Accounts and HPC Initiatives, and the rest of the Microway team the task of helping them design and deploy the right data storage solution. The team went through multiple rounds of consultation and possible iterations before selecting the final system design.

Originally, the NESE team was interested in both dense and deep hardware systems, with 80-90 drives per node. After learning from the extended Ceph community that this kind of configuration could lead to backlogs, failures, and system outages, they selected single-socket, 1U, 12-drive systems.

“Microway understands our particular approach and needs,” Youssef said. “They provided us quotes that we could use throughout the consortium to gain significant buy-in and worked with us to iterate design based on Ceph best practices and this project’s specific demands. Our relationship with them has been straightforward in terms of purchasing, but the systems we’ve created are really at the edge of what’s possible in data storage.”

The initial NESE deployment has five racks, each with space for 36 nodes. As of September 2019, it includes roughly 100 nodes in total. All nodes are connected to MGHPCC’s data network and contain high-density storage in a mix of traditional and high-speed SSDs.

The net result is over 20 PB of overall capacity, which can seamlessly expand even as much as 10 times as required in the future.

The overall solution also provides the diversity of storage that NESE needs, enabling a mix of high-performance, active, and archival storage across users. This has allowed for cost optimization, while the use of Ceph has ensured that all of that data is easily retrievable, regardless of a user’s storage use type.

Impact of an Innovative Data Storage Solution

With the implementation of NESE within MGHPCC, Massachusetts data science researchers now have a data storage resource that is large with the ability to grow and no need to migrate data across physical data storage over time. The project’s use of a distributed Ceph architecture will enable the NESE team to add new resources or decommission old ones while the system is active.

Data storage management by a single team within the consortium lowers administration labor effort and costs, adds greater flexibility for backups, and makes it easy to double storage for a lab or project.

The NESE team has elected to begin relatively “small,” with the 20 PB of storage currently used by a small portion of the consortium’s labs and researchers. Even so, the project has significant buy-in from throughout the MGHPCC consortium.

“It’s not unreasonable to expect our storage capacity to grow fivefold in the next few years,” Youssef said.

Harvard’s overall data storage needs alone have grown 10 PB per year for each of the last four years; other member institutions have seen similarly skyrocketing data storage needs. That’s because research is creating vast amounts of data, and growth isn’t linear. New generations of instrumentation in the life sciences mean increases in data production of five to 10 times every few years; even the social sciences and humanities, areas that once needed little by way of data storage, have begun to generate data through new research methodologies and other projects like library digitization.

With such vast swaths of data being generated yearly, cost concerns become more significant too. NESE is in a unique position to provide cost savings in data storage, thanks to its efficiently-run location within the MGHPCC building and dedicated management team. As a result, Youssef estimates that the cost of storage within NESE is about 1/6 to 1/10 [AA1] the cost of comparable commercial data storage solutions deployed on-campus. Being on the MGHPCC floor means that high-bandwidth connectivity to the storage is also affordable. With more competitive costs, NESE is freer to grow and expand into the future. These cost benefits, taken together with trust in Yockel’s operations team, is the basis for potential NESE rapid growth. Seventy percent of the initial storage has come from external buy-in, and more is expected.

Future Pathways

Though Youssef and Yockel aren’t sure exactly how large NESE will become, they’re certain it will — and has significant capacity to — grow. The current racks were provisioned for more nodes than they currently house, with about 1/3 [AA2] of the current space free for buy-in. While the capacity has served Harvard research teams to date, it will be allocated among all of the different universities as shared project space in the future. The start-up NESE storage is mainly used as Globus endpoints across the collaboration, storage for laboratories across the Harvard campus, and storage for the Large Hadron Collider project at CERN.

Shared data storage makes sharing data sets across research teams and universities far easier: there are no more challenges of data locality. Researcher 1 at Harvard may simply point Researcher 2 at Boston University to a data set already on the same storage, and the effects could be transformative, opening a pathway to more innovative, collaborative research that spans some of the nation’s top universities.