CatchAndRetry: Extending Exceptions to Handle Distributed System Failures and Recovery

In this paper, we present CatchAndRetry, an extension of the traditional exception mechanism to provide language-level support for common recovery techniques in distributed systems. We motivate and justify our design by analyzing several cases studies taken from the context of Facebook. CatchAndRetry is a language mechanism that is general enough to apply to multiple tiers of a distributed application; throughout this paper, we illustrate CatchAndRetry with examples of its use within both a large-scale distributed server-side application running in a data center as well as a JavaScript clients-side application running within a web browser.