System Health Check & Quick Recovery

Slack Docker Pulls

Health Check

The general health status of the system is the first thing we need to check when something unusual happens or after applying a change. This guidebook describes how to determine whether the system is healthy or not.

Metrics and Dashboard

The statistics of the metrics and the threshold of the errors can vary greatly with different specs and data workflow patterns. The general idea to diagnose is to check the week-over-week or day-over-day comparison of the key metrics to see if there are any significant differences.

Liveliness

For workers, the alluxio_data_access_bytes_count metric will calculate the received read/write requests to the worker. In Prometheus, we can query irate(alluxio_data_access_bytes_count[5m]) to calculate the request-per-second (RPS) of the worker. The RPS should be steady. If the RPS increases rapidly, there’s a chance that the worker may risk capacity problems.

UFS Data Flow

The alluxio_ufs_data_access metric will record the read/write data flow to the UFS on workers.

The alluxio_ufs_error metric will record the error codes of API access for each type of UFS. If the metric increases, there must be something wrong with the access to the UFS. You can use the error_code label to filter out the expected errors. As an example, a “no such file” in S3 will appear as a 404 error code. Note the error code value for the same type of error can vary between different types of UFS.

Cache Hit Rate

For workers, the alluxio_cached_data_read_bytes_total and alluxio_ufs_data_access_bytes_total metrics can calculate the cache hit percentage. To calculate the cache hit rate per second, we should use irate function on Prometheus, and then use sum to remove the unused labels.

  • The cache hit rate of a single worker will be:

    sum by (instance) (irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) / (sum by (instance) (irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) + sum by (instance) (irate(alluxio_ufs_data_access_bytes_total{job="worker"}[5m])))
    
  • The cache hit rate overall (already integrates in the dashboard):

    sum(irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) / (sum(irate(alluxio_cached_data_read_bytes_total{job="worker"}[5m])) + sum(irate(alluxio_ufs_data_access_bytes_total{job="worker"}[5m])))
    

General Status

Alluxio Process Readiness

# check the readiness of the workers in kubernetes
# ensure that the value for READY is 100%. a "Running" STATUS may not always mean the worker is healthy
$ kubectl get pod -l app.kubernetes.io/component=worker
NAME                              READY   STATUS    RESTARTS   AGE
alluxio-worker-59476bf8c5-lg4sc   1/1     Running   0          46h
alluxio-worker-59476bf8c5-vg6lc   1/1     Running   0          46h

# or use the following one-liner to get the readiness percentage
# if there are multiple alluxio clusters, use `app.kubernetes.io/instance=alluxio` to specify the cluster
$ kubectl get pod -l app.kubernetes.io/component=worker -o jsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}' | awk 'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'
2 ready / 2 expected = 100 %

ETCD Readiness

# check the readiness of the etcd cluster
# use `app.kubernetes.io/instance=alluxio` to select the etcd cluster integrated with alluxio
$ kubectl get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio'
NAME             READY   STATUS    RESTARTS   AGE
alluxio-etcd-0   1/1     Running   0          46h
alluxio-etcd-1   1/1     Running   0          46h
alluxio-etcd-2   1/1     Running   0          46h

# use the following one-liner to get the readiness percentage
$ kubectl get pod -l 'app.kubernetes.io/component=etcd,app.kubernetes.io/instance=alluxio' -o jsonpath='{range .items[*]}{.status.containerStatuses[0].ready}{"\n"}{end}' | awk 'BEGIN{t=0}{s+=1;if($1=="true")t+=1}END{print t,"ready /",s,"expected =",t/s*100,"%"}'
3 ready / 3 expected = 100 %

# check if the etcd can handle the normal I/O operation
# the output will be the registration of the alluxio workers. DO NOT change or write values in this way
$ kubectl exec -it alluxio-etcd-0 -c etcd -- bash -c 'ETCDCTL_API=3 etcdctl get --keys-only --prefix /ServiceDiscovery'
/ServiceDiscovery/default-alluxio/worker-1b4bad64-f195-46a8-be5f-27825a8100a4

/ServiceDiscovery/default-alluxio/worker-49ca0d6f-7a0a-482f-bb17-5550bd602a02

UFS Readiness

$ ./bin/alluxio exec ufsTest --path s3://your_bucket/test_path
Running test: createAtomicTest...
Passed the test! time: 5205ms
Running test: createEmptyTest...
Passed the test! time: 4076ms
Running test: createNoParentTest...
Passed the test! time: 0ms
Running test: createParentTest...
Passed the test! time: 4082ms
...
Running test: listStatusS3RootTest...
Passed the test! time: 543ms
Running test: createFileLessThanOnePartTest...
Passed the test! time: 3551ms
Running test: createAndAbortMultipartFileTest...
Passed the test! time: 6227ms
Tests completed with 0 failed.

# This example shows that the client will start two threads to write and then read a 512MB file to UFS, and print the test results
$ ./bin/alluxio exec ufsIOTest --path s3://test_bucket/test_path --io-size 512m --threads 2
{
  "readSpeedStat" : {
    "mTotalDurationSeconds" : 483.992,
    "mTotalSizeBytes" : 1073741824,
    "mMaxSpeedMbps" : 2.0614405926641703,
    "mMinSpeedMbps" : 1.0578687251028942,
    "mAvgSpeedMbps" : 1.5596546588835323,
    "mClusterAvgSpeedMbps" : 2.1157374502057884,
    "mStdDev" : 0.7096324729606261,
    "className" : "alluxio.stress.worker.IOTaskSummary$SpeedStat"
  },
  "writeSpeedStat" : {
    "mTotalDurationSeconds" : 172.136,
    "mTotalSizeBytes" : 1073741824,
    "mMaxSpeedMbps" : 3.5236227246137433,
    "mMinSpeedMbps" : 2.974392340939722,
    "mAvgSpeedMbps" : 3.2490075327767327,
    "mClusterAvgSpeedMbps" : 5.948784681879444,
    "mStdDev" : 0.3883645287295896,
    "className" : "alluxio.stress.worker.IOTaskSummary$SpeedStat"
  },
  "errors" : [ ],
  "baseParameters" : {
    "mCluster" : false,
    "mClusterLimit" : 0,
    "mClusterStartDelay" : "10s",
    "mJavaOpts" : [ ],
    "mProfileAgent" : "",
    "mBenchTimeout" : "20m",
    "mId" : "local-task-0",
    "mIndex" : "local-task-0",
    "mDistributed" : false,
    "mStartMs" : -1,
    "mInProcess" : true,
    "mHelp" : false
  },
  "points" : [ {
    "mMode" : "WRITE",
    "mDurationSeconds" : 145.305,
    "mDataSizeBytes" : 536870912,
    "className" : "alluxio.stress.worker.IOTaskResult$Point"
  }, {
    "mMode" : "WRITE",
    "mDurationSeconds" : 172.136,
    "mDataSizeBytes" : 536870912,
    "className" : "alluxio.stress.worker.IOTaskResult$Point"
  }, {
    "mMode" : "READ",
    "mDurationSeconds" : 248.37,
    "mDataSizeBytes" : 536870912,
    "className" : "alluxio.stress.worker.IOTaskResult$Point"
  }, {
    "mMode" : "READ",
    "mDurationSeconds" : 483.992,
    "mDataSizeBytes" : 536870912,
    "className" : "alluxio.stress.worker.IOTaskResult$Point"
  } ],
  "parameters" : {
    "className" : "alluxio.stress.worker.UfsIOParameters",
    "mThreads" : 2,
    "mDataSize" : "512m",
    "mPath" : "s3://test_bucket/test_path",
    "mUseUfsConf" : false,
    "mConf" : { }
  },
  "className" : "alluxio.stress.worker.IOTaskSummary"
}

Logs

Alluxio Processes

# use `kubectl logs <pod-name>` to get all logs of a pod
# on bare metal, the logs are located in `logs/<component>.log`, e.g.: `logs/worker.log`

# use `grep` to filter the specific log level
# when an expection is caught, there will be additional information after the WARN/ERROR log,
# use `-A` to print more lines
$ kubectl logs alluxio-worker-59476bf8c5-lg4sc | grep -A 1 'WARN\|ERROR'
2024-07-04 17:29:53,499 ERROR HdfsUfsStatusIterator - Failed to list the path hdfs://localhost:9000/
java.net.ConnectException: Call From myhost/192.168.1.10 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

# use `kubectl logs -p` to get logs from a previous failed container
$ kubectl logs -p alluxio-worker-59476bf8c5-lg4sc | grep -A 1 'WARN\|ERROR'
2024-07-05 14:22:28,290 [QuotaCoordinatorTimerThread] ERROR quota.QuotaCoordinator (QuotaCoordinator.java:lambda$new$0) - Failed to run quota janitor job
java.io.IOException: Failed to poll quota usage from worker WorkerInfo{id=7128818659502632567, identity=worker-7128818659502632567, address=WorkerNetAddress{host=sunbowen, containerHost=, rpcPort=64750, dataPort=64751, webPort=64753, domainSocketPath=, secureRpcPort=0, httpServerPort=0}, lastContactSec=283, state=LIVE, capacityBytes=1073741824, usedBytes=0, startTimeMs=1720160253072, capacityBytesOnTiers={MEM=1073741824}, usedBytesOnTiers={MEM=0}, version=3.x-7.0.0-SNAPSHOT, revision=fca83a4688187055d7abfd3a7d710b83a6b62ac6}

Quick Recovery

ETCD Failure

Alluxio has implemented automatic emergency handling for ETCD failures. When ETCD fails, Alluxio will tolerate the failure for a period (usually 24 hours) to ensure normal IO operations. During this time, it is expected that the administrator will resolve the ETCD issue.

When ETCD fails, it can typically be resolved by restarting the ETCD pod. In special cases, if the ETCD node cannot be restarted via Kubernetes, you can follow these steps to rebuild and restore the ETCD service:

  1. Shut down the cluster: kubectl delete -f alluxio-cluster.yaml
  2. Delete the original ETCD PVs and PVCs
  3. Backup the original ETCD PV directory in case we need to recover any data later
  4. Check the ETCD data folder of the ETCD PV on all physical machines and delete its contents
  5. Create three new ETCD PVs. If the PVs are manually created by administrator, it’s recommended to add nodeAffinity section in the PV to ensure that each PV corresponds to one ETCD node and one physical machine
apiVersion: v1
kind: PersistentVolumeClaim
spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - ip-10-0-4-212.ec2.internal
  1. Restart the cluster: kubectl create -f alluxio-cluster.yaml
  2. If the UFS mount points are not defined in UnderFileSystem resources, re-execute the alluxio mount add command to re-mount the UFS

Worker Failure

Alluxio can handle worker failures without interrupting I/O operations through optimization features. Generally, a small number of worker failures will not affect Alluxio’s functionality.

When a Worker fails, Kubernetes will restart the worker pod automatically without affecting the cached data.

Coordinator Failure

The coordinator handles the load job and management services. It persists job metadata to the metastore, which can be recovered upon restart. If the persisted data gets corrupted, the unfinished jobs are lost and need to be resubmitted.

When the coordinator fails, Kubernetes will restart the coordinator pod automatically.