Benchmarks
Contents
Benchmarks#
Below we document key performance benchmarks for common AIR tasks and workflows.
Bulk Ingest#
This task uses the DummyTrainer module to ingest 200GiB of synthetic data.
We test out the performance across different cluster sizes.
For this benchmark, we configured the nodes to have reasonable disk size and throughput to account for object spilling.
aws:
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
Iops: 5000
Throughput: 1000
VolumeSize: 1000
VolumeType: gp3
Cluster Setup |
Performance |
Disk Spill |
Command |
1 m5.4xlarge node (1 actor) |
390 s (0.51 GiB/s) |
205 GiB |
|
5 m5.4xlarge nodes (5 actors) |
70 s (2.85 GiB/S) |
206 GiB |
|
20 m5.4xlarge nodes (20 actors) |
3.8 s (52.6 GiB/s) |
0 GiB |
|
XGBoost Batch Prediction#
This task uses the BatchPredictor module to process different amounts of data using an XGBoost model.
We test out the performance across different cluster sizes and data sizes.
Cluster Setup |
Data Size |
Performance |
Command |
1 m5.4xlarge node (1 actor) |
10 GB (26M rows) |
275 s (94.5k rows/s) |
|
10 m5.4xlarge nodes (10 actors) |
100 GB (260M rows) |
331 s (786k rows/s) |
|
XGBoost training#
This task uses the XGBoostTrainer module to train on different sizes of data with different amounts of parallelism.
XGBoost parameters were kept as defaults for xgboost==1.6.1 this task.
Cluster Setup |
Data Size |
Performance |
Command |
1 m5.4xlarge node (1 actor) |
10 GB (26M rows) |
692 s |
|
10 m5.4xlarge nodes (10 actors) |
100 GB (260M rows) |
693 s |
|
GPU image batch prediction#
This task uses the BatchPredictor module to process different amounts of data using a Pytorch pre-trained ResNet model.
We test out the performance across different cluster sizes and data sizes.
Cluster Setup |
Data Size |
Performance |
Command |
1 g4dn.8xlarge node |
1 GB (1623 images) |
46.12 s (35.19 images/sec) |
|
1 g4dn.8xlarge node |
20 GB (32460 images) |
285.2 s (113.81 images/sec) |
|
4 g4dn.12xlarge nodes |
100 GB (162300 images) |
304.01 s (533.86 images/sec) |
|
GPU image training#
This task uses the TorchTrainer module to train different amounts of data using an Pytorch ResNet model.
We test out the performance across different cluster sizes and data sizes.
Note
For multi-host distributed training, on AWS we need to ensure ec2 instances are in the same VPC and all ports are open in the secure group.
Cluster Setup |
Data Size |
Performance |
Command |
1 g3.8xlarge node (1 worker) |
1 GB (1623 images) |
79.76 s (2 epochs, 40.7 images/sec) |
|
1 g3.8xlarge node (1 worker) |
20 GB (32460 images) |
1388.33 s (2 epochs, 46.76 images/sec) |
|
4 g3.16xlarge nodes (16 workers) |
100 GB (162300 images) |
434.95 s (2 epochs, 746.29 images/sec) |
|
Pytorch Training Parity#
This task checks the performance parity between native Pytorch Distributed and Ray Train’s distributed TorchTrainer.
We demonstrate that the performance is similar (within 2.5%) between the two frameworks. Performance may vary greatly across different model, hardware, and cluster configurations.
The reported times are for the raw training times. There is an unreported constant setup overhead of a few seconds for both methods that is negligible for longer training runs.
Cluster Setup |
Dataset |
Performance |
Command |
4 m5.2xlarge nodes (4 workers) |
FashionMNIST |
196.64 s (vs 194.90 s Pytorch) |
|
4 m5.2xlarge nodes (16 workers) |
FashionMNIST |
430.88 s (vs 475.97 s Pytorch) |
|
4 g4dn.12xlarge node (16 workers) |
FashionMNIST |
149.80 s (vs 146.46 s Pytorch) |
|
Tensorflow Training Parity#
This task checks the performance parity between native Tensorflow Distributed and Ray Train’s distributed TensorflowTrainer.
We demonstrate that the performance is similar (within 1%) between the two frameworks. Performance may vary greatly across different model, hardware, and cluster configurations.
The reported times are for the raw training times. There is an unreported constant setup overhead of a few seconds for both methods that is negligible for longer training runs.
Note
The batch size and number of epochs is different for the GPU benchmark, resulting in a longer runtime.
Cluster Setup |
Dataset |
Performance |
Command |
4 m5.2xlarge nodes (4 workers) |
FashionMNIST |
78.81 s (vs 79.67 s Tensorflow) |
|
4 m5.2xlarge nodes (16 workers) |
FashionMNIST |
64.57 s (vs 67.45 s Tensorflow) |
|
4 g4dn.12xlarge node (16 workers) |
FashionMNIST |
465.16 s (vs 461.74 s Tensorflow) |
|