Looking for a cancer-causing variation requires the examination of literally millions of different spots within the human DNA in thousands of individuals—a seemingly overwhelming amount of information. In fact, until recently it was so much data that the storage and processing capacities of computing platforms couldn’t efficiently handle the loads, which limited the work of scientists searching for cures.
The researchers at Arizona State University, however, thought big data computing could be just the tool they needed to start making some serious breakthroughs.
ASU is the largest public university by enrollment in the United States, with more than 83,000 students and 3,300 faculty members. The university’s charter is based on a “New American University” model that encourages teaching and research. A program launched under that model is ASU’s Complex Adaptive Systems Initiative (CASI), which aims to tap different departments across the university to come up with new technologies and solutions to the immense challenges the world faces in health, sustainability, security and education.
The use of big data and analytics technologies has become a huge part of the CASI effort.
The Arizona State University campus
Jay Etchings, ASU’s director of research computing, and Ken Buetow, director of the Computational Sciences and Informatics program for Complex Adaptive Systems at ASU, three years ago deployed an Apache Hadoop, open-source programming framework. They were drawn by Hadoop’s capacity to process, analyze and store extremely large data sets in a distributed computing environment.
Today, ASU has developed “what is considered a first-generation data science research instrument that is already proving useful in the growing area of personalized medicine,” says Etchings, making reference to a category of medical care becoming common in fighting specific cancers within particular groups of people.
At the center of ASU’s Hadoop environment is HortonWorks’ Data Platform. HDP allows vast computing resources to work together on computations. To process HDP data, ASU choose other Apache tools, including Apache’s YARN (Yet Another Resource Negotiator), which coordinates data inputs and allocates application resources, and Apache’s Spark, a processing engine for data analytics.
The HortonWorks platform is used to store, process, and query huge amounts of data (in what’s known as a data lake), to make the data and analytics tools accessible to diverse researchers, and to do it at a cost that doesn’t increase as the amount of genomics data grows. Currently, the ASU working data volume is nearly 4 petabyte.
“Our project facilitates use and reuse of large-scale genomics data, a challenge common to all research institutions addressing grand challenges in precision medicine,” Etchings says. “The Hadoop ecosystem and related data lake structure [eliminates] the need for each researcher and clinical user to manage the large, complex genomic data footprint.”
The data in a single human genome includes about 20,000 genes. If that data were stored in a traditional platform it would represent several hundred gigabytes. Adding specialized genomic characterization of variation at almost a million DNA locations generates 20 billion rows of gene-variant combinations for each population examined.
And ASU’s Hadoop cluster holds data on thousands of individuals.
Using databases of 20 billion rows was not possible with traditional storage and processing technology. But with the Hadoop infrastructure, CASI can run data-intensive queries of large-scale resources and get back results in a matter of seconds.
Leveraging its analytics capabilities, the university continues to identify the mechanisms of cancer and how networks of genes drive cancer susceptibility and outcome. These networks are observed to be common across different types of cancer.
Among the challenges ASU faces in its big data initiatives are ensuring compliance with regulations regarding data management and privacy.
“All elements within the campus cyber infrastructure are subject to various levels of regulatory compliance,” Etchings says. This includes the Family Educational Rights and Privacy Act, a U.S. federal law that governs the access of educational information and records. “As the proliferation of data-intensive computation sweeps across campus, we face tighter sets of controls around confidentiality, integrity, authorization and access.”
Like other research institutions, ASU selected FISMA and NIST 800-171 as guideposts to greater levels of confidentiality, integrity and availability, Etchings says. FISMA, or the Federal Information Security Management Act, is a law that requires each U.S. federal agency to develop, document and implement an agency-wide program to provide security for the information and systems that support the operations and assets of the agency—including those provided or managed by a third party. NIST 800-171 is an National Institute of Standards and Technology regulatory compliance model for controlled unclassified information in non-federal systems and organizations, he says. The controls within NIST 800-171 explain a set of compliance measures aligned with FISMA.
With those guidelines, ASU deployed a pair of Apache tools—Atlas and Ranger—to help with its compliance and regulation efforts. Atlas is a set of core foundational governance services that allow organizations to meet their compliance requirements within Hadoop. Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
“The paradigm shift in information management models requires innovative elements such as Apache Atlas for metadata management and data governance … and Apache Ranger to deliver comprehensive security across the institutional data lake,” Etchings says.
These components play critical roles in securing research data at ASU.
“An important factor to remember is that we hold the most unique form of personal health information in the human genome,” Etchings says. “Therefore it behooves data management stewards to tune access controls to the most restrictive standards” outlined by the National Institute of Standards and Technology.
A Foundation for the Future
When creating its data-intensive environment, ASU had flexibility in mind. The university aimed to build a utility computing infrastructure that provides high-performance computing.
This environment enables users to seamlessly move between work environments where they might be developing code for a new high-performance computing application, for example, and the Hadoop space. If users need to run an application on a more traditional, high-performance computing platform, they can output that traditional computing job into data frameworks that could then be processed in the Hadoop environment, Etchings says.
This kind of flexibility, combined with powerful storage, processing and analytics capabilities provided by the latest technology, allow ASU to meet its research goals and contribute to major advances in medicine and other areas.
While there have been challenges, ASU believes its efforts are paying off. One of key reasons for investing in the big data and analytics technologies is so that ASU can find novel, relevant patterns in large, diverse, complex, biomedical data, Buetow says. That provides “the foundation for new frontiers in personalized medicine through genomic analysis.”