15 Detecting and Handling Distribution Problems: Fault

This section summarizes the operations of the Fault module and their argument types. Please refer to the Distribution Tutorial for a full specification of the operations and examples of how to use them. This section carefully indicates where the current release is incomplete with respect to the specification (called a limitation) or has a different behavior (called a modification).

15.1 Argument Types

We summarize the argument types for the operations in the Fault module.

Entity

A reference to any Oz language entity that has distributed fault modes, namely any object, cell, lock, port, or logic variable.

Level

Either site or 'thread'(T), where T is a thread reference or the atom this.1

FStates

A set of fault states, i.e., a list that can contain at most one of each of the elements tempFail, permFail, remoteProblem(tempSome), remoteProblem(permSome), remoteProblem(tempAll), and remoteProblem(permAll).

OP

A record that indicates which attempted operation caused the exception or handler invocation. The value of OP is one of:

  • bind(T), wait, isDet (for logic variables).

  • cellExchange(Old New), cellAssign(New), cellAccess(Old) (for cells).

  • 'lock' (for locks).

  • send(Msg) (for ports).

  • objectExchange(Attr Old New), objectAssign(Attr New), objectAccess(Attr Old), objectFetch (for objects). A limitation of the current release is that an attempted operation on an object cannot be retried.

HandlerProc

A handler, i.e., a three-argument procedure that is called as {HandlerProc Entity FStates OP}, where FStates is a set of currently active fault states. A handler replaces an attempted operation on an entity.

WatcherProc

A watcher, i.e., a two-argument procedure that is called in its own thread as {WatcherProc Entity FStates}, where FStates is a set of currently active fault states. A watcher is invoked as soon as the site detects a fault.

15.2 Fault Information

When there is a distribution problem, then three items of information are made available:

The system can be configured (see below) so that these three items appear in one or more of the following three ways:

A limitation of the current release is that the Entity argument is undefined for an object operation. For handlers and watchers, this limitation can be bypassed by giving the handler and watcher procedures a reference to the object.

15.3 Operations

The Fault module contains the following operations. All operations return a boolean flag B that is true if the operation succeeds and false otherwise. All enable and install operations succeed if nothing was enabled or installed at that level. An entity with a successful enable or install at a given level is said to have fault detection at that level. All disable and deInstall operations succeed if nothing was disabled or deinstalled at that level. The system starts up as if {Fault.defaultEnable [tempFail permFail] _} was executed.

All the following operations that have an Entity argument will do nothing if entity does not have distributed fault modes. If a logic variable with fault detection is bound to a nonvariable entity, then the fault detection is transferred to the entity, provided the latter has no fault detection at that level.

{Fault.defaultEnable FStates ?B}

Sets the default fault detection to FStates on the current site. When an operation is attempted on an entity and there is no fault detection on the site or thread level for the entity, then the default fault detection is used. This always succeeds.

{Fault.defaultDisable ?B}

Sets the default fault detection to nil on the current site. This always succeeds.

{Fault.enable Entity Level FStates ?B}

Enables fault detection on a given entity at a given level for a given set of fault states. An exception is raised if a fault is detected when an operation is attempted on the entity.

{Fault.disable Entity Level ?B}

Disables fault detection on a given entity at a given level.

{Fault.install Entity Level FStates HandlerProc ?B}

Installs a handler for fault detection on a given entity at a given level for a given set of fault states. The handler {HandlerProc Entity AFStates OP} is called if a fault is detected when an operation is attempted on the entity. A modification of the current release with respect to the specification is that handlers installed on variables always retry the operation after they return.

{Fault.deInstall Entity Level ?B}

Deinstalls a handler for fault detection on a given entity at a given level.

{Fault.installWatcher Entity FStates WatcherProc ?B}

Installs a watcher for fault detection on a given entity for a given set of fault states. Any number of watchers can be installed on an entity. It is always possible to install a watcher, so therefore this always succeeds. The watcher {WatcherProc Entity AFStates} is called in its own thread as soon as the site detects a fault.

{Fault.deInstallWatcher Entity WatcherProc ?B}

Deinstalls the given watcher on a given entity. This call succeeds if WatcherProc was installed on the entity. If there is more than one instance of WatcherProc installed on the entity, then exactly one is deinstalled.

On a given entity at the global level, at most one enable can be done or one handler installed. For a given entity, the site level can have at most one fault detection per site. The 'thread'(T) can have at most one fault detection per thread. To have another fault detection, it is necessary to do a disable or deinstall first.

15.4 Limitations and Modifications

The current release has the following limitations and modifications with respect to the failure model specification. A limitation is an operation that is specified but not possible in the current release. A modification is an operation that is specified but behaves differently in the current release.

Most of the limitations and modifications listed here will be removed in future releases.

15.5 Limitations

The limitations are:

15.6 Modifications

The modifications are:


1. Since thread is already used as a keyword in the language, it has to be quoted to make it an atom.

Denys Duchier, Leif Kornstaedt, Martin Homik, Tobias Müller, Christian Schulte and Peter Van Roy
Version 1.2.3 (20011204)