Install¶
Basic Installation¶
The best way to install the Cerebro
is via pip.
pip install -U cerebro-dl
Alternatively, you can git clone and run the provided Makefile script to install the master branch.
git clone https://github.com/ADALabUCSD/cerebro-system.git && cd cerebro-system && make
You MUST be running on Python >= 3.6 with Tensorflow >= 2.3 and Apache Spark >= 2.4
Spark Cluster Setup¶
As deep learning workloads tend to have very different resource requirements from typical data processing workloads, there are certain considerations for DL Spark cluster setup.
GPU training¶
For GPU training, one approach is to set up a separate GPU Spark cluster
and configure each executor with # of CPU cores
= # of GPUs
. This can
be accomplished in standalone mode as follows:
$ echo "export SPARK_WORKER_CORES=<# of GPUs>" >> /path/to/spark/conf/spark-env.sh
$ /path/to/spark/sbin/start-all.sh
This approach turns the spark.task.cpus
setting to control # of GPUs
requested per process (defaults to 1).
The ongoing SPARK-24615 effort aims to introduce GPU-aware resource scheduling in future versions of Spark.
CPU training¶
For CPU training, one approach is to specify the spark.task.cpus
setting
during the training session creation:
conf = SparkConf().setAppName('training') \
.setMaster('spark://training-cluster:7077') \
.set('spark.task.cpus', '16')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
This approach allows you to reuse the same Spark cluster for data preparation and training.