I/O Resiliency

The document describes how Alluxio handles unexpected or expected downtime, such as when the Alluxio workers become unavailable. Multiple mechanisms are in place to handle application requests gracefully and ensure the overall I/O resiliency of the Alluxio system.

UFS fallback mechanism

Alluxio clients read both the metadata and data from Alluxio workers. If the requested worker becomes unavailable or returns with an error, the client can fallback to requesting this information from the UFS instead.

The UFS fallback feature is enabled by default. This may cause undesirable behavior in certain situations, such as potentially overloading the UFS with a large number of requests, referred to as the thundering herd problem.

To disable the feature, set the following property:

alluxio.dora.client.ufs.fallback.enabled=false

FUSE UFS fallback while reading a file

If the user is reading files via the FUSE interface and the worker becomes unavailable in the middle of the operation, Alluxio will seamlessly switch the read source from the worker to the UFS. This prevents the need for the client to explicitly retry the read operation in the case of a worker being removed.

Retry across replications

When multiple replication is enabled and the client fails to get a response from a worker, the client will retry on other available workers. This applies to both data and metadata access. Again, this prevents the need for the client to explicitly retry the read operation if the target worker suddenly becomes unresponsive.

Membership changes

Adding workers

When a worker is added to the cluster, the newly added worker will take over some hash range from existing workers according to the consistent hashing algorithm. Clients will still read files from the existing set of workers until the clients detect that the membership has changed. In the case that requests are directed to the new worker, read operations may slow down since the worker would need to load the data again from UFS, but the operation should not encounter any errors.

Removing workers

When a worker is removed, the paths previously directed to the removed worker will be distributed to other workers, again according to the consistent hashing algorithm. Ongoing requests served by the removed worker will fail. If the UFS fallback mechanism is enabled, the client will attempt to retry the request directly to the UFS. Once the client detects the membership change, these requests will be properly directed to the remaining live workers.

With the FUSE UFS fallback feature, the FUSE client should not encounter any error when workers are removed. For S3 & fsspec interfaces, the application might see transient read errors for ongoing reads, but should be able to recover once the worker membership is updated.

Restarting workers

Restarting a worker behaves in the same way as removing a worker and subsequently adding it back. Because each worker has a unique identity, which is used to form the consistent hash ring, the restarted worker will be responsible for the same paths it had been previously managing, such that the cached data on the worker will remain valid. Again, FUSE clients should not observe any read failures in the case of worker restarts.

Categories