Running Tensorflow on Alluxio-FUSE
This guide describes how to run Tensorflow on top of Alluxio POSIX API.
Overview
Tensorflow enables developers to quickly and easily get started with deep learning. This tutorial aims to provide some hands-on examples and tips for running Tensorflow on top of Alluxio POSIX API.
Prerequisites
- Setup Java for Java 8 Update 60 or higher (8u60+), 64-bit.
- Alluxio has been set up and is running.
- Python3 installed.
- Numpy installed. This guide uses numpy 1.19.5.
- Tensorflow installed. This guide uses Tensorflow v1.15.
Setting up Alluxio POSIX API
Run the following command to install FUSE on Linux:
$ yum install fuse fuse-devel
On macOS, download the osxfuse dmg file instead and follow the installation instructions.
In this guide, we use /training-data as Alluxio-Fuse’s root directory and /mnt/fuse as the mount point of local directory.
Create a folder at the root in Alluxio:
$ ./bin/alluxio fs mkdir /training-data
Create a folder /mnt/fuse
, change its owner to the current user ($(whoami)
),
and change its permissions to allow read and write:
$ sudo mkdir -p /mnt/fuse
$ sudo chown $(whoami) /mnt/fuse
$ chmod 755 /mnt/fuse
Configure conf/alluxio-site.properties
:
alluxio.fuse.mount.alluxio.path=/training-data
alluxio.fuse.mount.point=/mnt/fuse
Follow the instructions for Mount Under Storage Dataset to finish setting up Alluxio POSIX API and allow Tensorflow applications to access the data through Alluxio POSIX API.
Example: Image Recognition
Preparing training data
If the training data is already in a remote data storage, you can mount it as a folder under the Alluxio /training-data
directory.
This data will be visible to the applications running on local /mnt/fuse/
.
If the data is not in a remote data storage, you can copy it to Alluxio namespace:
$ wget http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz
$ ./bin/alluxio fs mkdir /training-data/imagenet
$ ./bin/alluxio fs cp file://inception-2015-12-05.tgz /training-data/imagenet
Suppose the ImageNet’s data is stored in an S3 bucket s3://alluxio-tensorflow-imagenet/
,
the following three commands will show the exact same data after the two mount processes:
aws s3 ls s3://alluxio-tensorflow-imagenet/
# 2019-02-07 03:51:15 0
# 2019-02-07 03:56:09 88931400 inception-2015-12-05.tgz
bin/alluxio fs ls /training-data/imagenet/
# -rwx---rwx ec2-user ec2-user 88931400 PERSISTED 02-07-2019 03:56:09:000 0% /training-data/imagenet/inception-2015-12-05.tgz
ls -l /mnt/fuse/imagenet/
# total 0
# -rwx---rwx 0 ec2-user ec2-user 88931400 Feb 7 03:56 inception-2015-12-05.tgz
Run image recognition test
Download the image recognition script and run it with the training data.
$ curl -o classify_image.py -L https://raw.githubusercontent.com/tensorflow/models/v1.11/tutorials/image/imagenet/classify_image.py
$ python classify_image.py --model_dir /mnt/fuse/imagenet/
This will use the input data in /mnt/fuse/imagenet/
to recognize images,
and if everything works you will see something like this in your command prompt:
giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107)
indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779)
lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296)
custard apple (score = 0.00147)
earthstar (score = 0.00117)
Tips
Write Tensorflow applications with data location parameter
Running Tensorflow on top of HDFS, S3, and other under storages could require different configurations, making it difficult to manage and integrate Tensorflow applications with different under storages. Through Alluxio POSIX API, users only need to mount under storages to Alluxio once and mount the parent folder of those under storages that contain training data to the local filesystem. After the initial mounting, the data becomes immediately available through the Alluxio FUSE mount point and can be transparently accessed in Tensorflow applications. If a Tensorflow application has the data location parameter set, we only need to pass the data location inside the FUSE mount point to the Tensorflow application without modifying it. This greatly simplifies the application development, which otherwise would require different integration setups and credential configurations for each under storage.
Co-locating Tensorflow with Alluxio worker
By co-locating Tensorflow applications with an Alluxio Worker, Alluxio caches the remote data locally for future access, providing data locality. Without Alluxio, slow remote storage may result in bottleneck on I/O and leave GPU resources underutilized. When concurrently writing or reading big files, Alluxio POSIX API can provide significantly better performance when running on an Alluxio Worker node. Setting up a Worker node with memory space to host all the training data can allow the Alluxio POSIX API to provide nearly 2X performance improvement.
Configure Alluxio write type and read type
Many Tensorflow applications generate a lot of small intermediate files during their workflow. Those intermediate files are only useful for a short time and do not need to be persisted to under storages. If we directly link Tensorflow with remote storages, all files (regardless of the type - data files, intermediate files, results, etc.) will be written to and persisted in the remote storage. With Alluxio – a cache layer between the Tensorflow applications and remote storage, users can reduce unneeded remote persistent work and speed up the write/read time.
With alluxio.user.file.writetype.default
set to MUST_CACHE
, we can write to the top tier (usually it is the memory
tier) of Alluxio Worker storage.
With alluxio.user.file.readtype.default
set to CACHE_PROMOTE
, we can cache the read data in Alluxio for future access.
This will accelerate our Tensorflow workflow by writing to and reading from memory.
If the remote storages are cloud storages like S3, the advantages will be more obvious.