Déjà Vu: Transparent Checkpointing And Migration Of Parallel Codes Over Grid Infrastructures

Start Date: 02/01/2004
End Date: 03/01/2009

A daunting challenge is the evolution from today’s computational Grid to a true cyberinfrastructure that seamlessly integrates resources ranging from small clusters in academic laboratories to the largest national supercomputing centers and provides ubiquitous access to high performance computing, research instrumentation, data warehouses and visualization. Realization of this future requires fundamental advances in transparent fault recovery mechanisms to mask component failures endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today’s high performance computing (HPC) environments are based on clusters of COTS components, with no systemic solution for the reliability of the resource as a whole.  Engendering stability in ever growing networked collections of cluster systems needs a software solution that provides reliable access to computing resources through transparent, efficient, and automatic checkpointing and recovery (CPR) mechanisms – a view echoed in the recent emphasis on recovery-oriented computing.

Furthermore, the future of the Grid paradigm hinges on its ability to effectively span computational scales ranging from small clusters to large national supercomputing centers. Facilitating this view requires us to reconcile the starvation concerns of large computational applications that run across distributed resources, with the administrative control needs of the smaller individual units that comprise the Grid – a significant participation barrier. Enabling the Grid paradigm calls for fluid control of ever-changing computational resources, a view where subsets of jobs transparently migrate under the control of resource aware scheduling mechanisms.

This proposal aims to bring about this future through radically new approaches to longstanding problems in CPR and process migration by building an integrated system called Déjà vu. Déjà vu provides (a) a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications. (b) a novel post-compiler analysis system that transparently captures application state, (c) a systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge, (d) novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency, (e) a novel communications architecture that enables transparent migration of existing MPI/PVM codes without source-code modifications to either the application or the MPI/PVM libraries, (f) recoverable IO subsystems that can be tailored to specific storage environments, and (g) interfaces to and augmentation of the Globus Toolkit to effectively use the CPR and migration capabilities provided by this research. The core CPR and migration facilities of Déjà vu will be surrounded by management, security, and scheduling facilities that (a) integrate with local scheduling systems (e.g., OpenPBS) and accounting systems for site-specific accounting and refunding of lost compute cycles and (b) extend the Globus security architecture with fine grain rights and dynamically created user accounts that allow the fluid resource control available under the Déjà vu system to be fully exploited.

Our design goal is not just to implement “point” solutions, but an integrated system that will constitute a fundamental component of both large-scale computing facilities and Grid infrastructures. This proposal is timely; it will lower barriers to participation in nascent Grid infrastructures and provide reliable access to high performance computing resources.

 

Grant Institution: National Science Foundation

Amount: $715,000

People associated with this grant:

Dennis Kafura
Calvin Ribbens
Srinidhi Varadarajan