Past Research Projects

Secure and Resilient Architecture: Scientific Workflow Integrity with Pegasus (SWIP)

The Scientific Workflow Integrity with Pegasus project strengthens cybersecurity controls in the Pegasus Workflow Management System in order to provide assurances with respect to the integrity of computational scientific methods. These strengthened controls enhance both Pegasus’ handling of science data and its orchestration of software-defined networks and infrastructure. The result is increased trust in computational science and increased assurance in our ability to reproduce the science by allowing scientists to validate that data has not been changed since a workflow completed and that the results from multiple workflows are consistent.

Learn more

Funding Agency: NSF



Population Architecture using Genomics and Epidemiology (PAGE)

Over recent years, genome-wide association studies (GWAS) have allowed researchers to uncover hundreds of genetic variants associated with common diseases. However, the discovery of genetic variants through GWAS research represents just the first step in the challenging process of piecing together the complex biological picture of common diseases. To help speed the process, the National Human Genome Research Institute, is supporting new research in existing large epidemiology studies, all with a rich range of measures of health and potential disease, and many with long-term follow-up. The focus of the new research is on how genetic variants initially identified through GWAS research are related to a person’s biological and physical characteristics, such as weight, cholesterol levels, blood sugar levels or bone density. Scientists will also examine how non-genetic factors, such as diet, medications and smoking, may interact with genetic factors or each other to influence health outcomes.

Learn more

Funding Agency: NIH

Model Integration through Knowledge-Rich Data and Process Composition (MINT)

Major societal and environmental challenges require forecasting how natural processes and human activities affect one another. There are many areas of the globe where climate affects water resources and therefore food availability, with major economic and social implications. Today, such analyses require significant effort to integrate highly heterogeneous models from separate disciplines, including geosciences, agriculture, economics, and social sciences. Model integration requires resolving semantic, spatio-temporal, and execution mismatches, which are largely done by hand today and may take more than two years. The Model INTegration (MINT) project will develop a modeling environment which will significantly reduce the time needed to develop new integrated models, while ensuring their utility and accuracy. Research topics to be addressed include: 1) New principle-based semiautomatic ontology generation tools for modeling variables, to ground analytic graphs to describe models and data; 2) A novel workflow compiler using abductive reasoning to hypothesize new models and data transformation steps; 3) A new data discovery and integration framework that finds new sources of data, learns to extract information from both online sources and remote sensing data, and transforms the data into the format required by the models; 4) A new methodology for spatio-temporal scale selection; 5) New knowledge-guided machine learning algorithms for model parameterization to improve accuracy; 6) A novel framework for multi-modal scalable workflow execution; and 7) Novel composable agroeconomic models.

Learn more

Funding Agency: DARPA


dV/dT: Accelerating the Rate of Progress towards Extreme Scale Collaborative Science

The dV/dt project will develop and evaluate by means of at-scale experimentation novel algorithms and software architectures that will make it less labor intensive for a scientist to find the appropriate computing resources, acquire those resources, deploy the desired applications and data on these resources, and then manage them as the applications run. The proposed research will advance the understanding of resource management within a collaboration in the areas of: trust, planning for resource provisioning, and workload, computer, data, and network resource management. This work will result in research artifacts (frameworks, algorithms, simulators, and execution traces) as well as an experimental testbed that will support the proposed research and will be made available to the broader DOE community.

Learn more

Funding Agency: DOE


Precip – Pegasus Repeatable Experiments for the Cloud in Python

Precip is a flexible exeperiment management API for running experiments on clouds. Precip was developed for use on FutureGrid infrastructures such as OpenStack, Eucalyptus (>=3.2), Nimbus, and at the same time commercial clouds such as Amazon EC2. The API allows you to easily provision resources, which you can then can run commands on and copy files to/from subsets of instances identified by tags. The goal of the API is to be flexible and simple to use in Python scripts to control your experiments.

Learn more Source Code

Funding Agency: NSF


WorkflowSim: A Toolkit for Simulating Scientific Workflows in Distributed Environments

WorkflowSim is an open source workflow simulator that extends CloudSim by providing a workflow level support of simulation. It models workflows with a DAG model with support an elaborate model of node failures, a model of delays occurring in the various levels of the WMS stack, and the implementations of several most popular dynamic and static workflow schedulers (e.g., HEFT, Min-Min) and task clustering algorithms (e.g., runtime-based algorithms, data-oriented algorithms and fault tolerant clustering algorithms). Parameters are directly learned from traces of real executions. It has been recently used in multiple workflow study areas such as fault tolerant clustering, balanced task clustering, cloud brokers, energy aware scheduling, cost-oriented scheduling and so on.

Learn more Source Code

Funding Agency: NSF


Transforming Computational Science with ADAMANT (Adaptive Data-Aware Multi-domain Application Network Introduction Topologies)

Project ADAMANT (Adaptive Data-Aware Multi-domain Application Network Topologies) brings together researchers from RENCI/UNC Chapel Hill, Duke University and USC/ISI and two successful software tools to solve these problems: Pegasus workflow management system and ORCA resource control framework, developed for NSF GENI. The integration of Pegasus and ORCA enables powerful application- and data-driven virtual topology embedding into multiple institutional and national substrates (providers of cyber-resources, like computation, storage and networks). ADAMANT leverages ExoGENI – an NSF-funded GENI testbed, as well as national providers of on-demand bandwidth services (NLR, I2, ESnet) and existing OSG computational resources to create elastic, isolated environments to execute complex distributed tasks. This approach improves the performance of these applications and, by explicitly including data movement planning into the application workflow, enables new unique capabilities for distributed data-driven “Big Science” applications.

Funding Agency: NSF



FutureGrid is a distributed, high-performance test-bed that allows scientists to collaboratively develop and test innovative approaches to parallel, grid, and cloud computing.The test-bed is composed of a set of distributed high-performance computing resources connected by a high-speed network (with adjustable performance via a network impairment device). Users can access the HPC resources as traditional batch clusters, a computational grid, or as highly configurable cloud resources where users can deploy their own virtual machines. The flexibility in configuration of FutureGrid resources enables its use across a variety of research and education projects. To learn more about how to join FutureGrid, visit the “Getting Started” page as part of the FutureGrid Manual. Pegasus has two parallel roles to play in the framework of FutureGrid: (1) Vanilla Pegasus in FutureGrid is deployed to draw in existing user communities by providing a familiar context on new resources; (2) the Pegasus workflow management system is an essential building block of the Experiment Management capabilities developed within the FutureGrid context.

Learn more

Funding Agency: NSF


Synthesized Tools for Archiving, Monitoring Performance and Enhanced DEbugging (STAMPEDE)

Large-scale applications today make use of distributed resources to support computations and as part of their execution, generate large amounts of log information. Up to now, we have been using the Netlogger analysis tools to perform off-line log analysis. Stampede extends the current offline workflow log analysis capability and develops a comprehensive middleware solution that will allow users of complex scientific applications to track the status of their jobs in real time, to detect execution anomalies automatically, and to perform on-line troubleshooting without logging in to remote nodes or searching through thousands of log files. The system will be able to capture application-level logs from jobs as they are executing on the cyberinfrastructure. At the same time, it will also collect log information from the underlying cyberinfrastructure services, such as resource management and data transfer. These end-to-end logs will be combined and brokered through a subscription interface. External components will use the subscription interface to provide monitoring services.

Funding Agency: NSF



The Brain Span project seeks to find when and where in the brain a gene is expressed. This information holds clues to potential causes of disease. A recent study found that forms of a gene associated with schizophrenia are over-expressed in the fetal brain. To make such discoveries about what is abnormal, scientists first need to know what the normal patterns of gene expression are during development. To this end, the National Institute of Mental Health (NIMH), part of the National Institutes of Health (NIH), has funded the creation of TADHB. To map human brain “transcriptomes”, researchers identify the composition of intermediate products, called transcripts or messenger RNAs, which translate genes into proteins throughout development. As part of this project we have enabled the geneticists to analyze over 225 human brain RNA sequences using two different mapping algorithms CASAVA ELAND and Perm.

Learn more

Corral WMS

Corral and glideinWMS currently operate as standalone resource provisioning systems. GlideinWMS was initially developed to meet the needs of the CMS (Compact Muon Solenoid) experiment at the Large Hadron Collider (LHC) at CERN. It generalizes a Condor glideIn system developed for CDF (The Collider Detector at Fermilab) and first deployed for production in 2003. It has been in production across the Worldwide LHC Computing Grid (WLCG), with major contributions from the Open Science Grid (OSG) in support of CMS for the past two years, and has recently been adopted for user analysis. GlideinWMS also is currently being used by the CDF, DZero, and MINOS experiments, and servicing the NEBioGrid and Holland Computing Center communities. GlideinWMS has been used in production with more than 8,000 concurrently running jobs; the CMS use alone totals over 45 million hours.

Learn more


Boutiques: A cross-platform application repository for science gateways

Porting applications to science gateways is a critical step to enable their execution on distributed computing infrastructures and their sharing among scientific communities. However, application porting remains a costly human effort that consists of 1) installing the application on the target execution platform, 2) describing the application in a format compatible with the science gateway, and 3) generating proper user interfaces. Due to the variety of science gateways, application porting efforts are often replicated several times while mutualization would save cost and improve the quality of the ported applications. Boutiques is an application repository that allows automatic import and exchange of applications in science gateways. Compared to previous initiatives, our repository relies on Linux containers to solve the problem of application installation in a lightweight manner. In addition, it adopts a flexible application description format which is versatile enough to be used in various science gateways.

Learn more Source Code