(clusters-vm-ml-example)= # Ray Train XGBoostTrainer on VMs :::{note} To learn the basics of Ray on VMs, we recommend taking a look at the {ref}`introductory guide ` first. ::: In this guide, we show you how to run a sample Ray machine learning workload on AWS. The similar steps can be used to deploy on GCP or Azure as well. We will run Ray's {ref}`XGBoost training benchmark ` with a 100 gigabyte training set. To learn more about using Ray's XGBoostTrainer, check out {ref}`the XGBoostTrainer documentation `. ## VM cluster setup For the workload in this guide, it is recommended to use the following setup: - 10 nodes total - A capacity of 16 CPU and 64 Gi memory per node. For the major cloud providers, suitable instance types include * m5.4xlarge (Amazon Web Services) * Standard_D5_v2 (Azure) * e2-standard-16 (Google Cloud) - Each node should be configured with 1000 gigabytes of disk space (to store the training set). The corresponding cluster configuration file is as follows: ```{literalinclude} ../configs/xgboost-benchmark.yaml :language: yaml ``` ```{admonition} Optional: Set up an autoscaling cluster **If you would like to try running the workload with autoscaling enabled**, change ``min_workers`` of worker nodes to 0. After the workload is submitted, 9 workers nodes will scale up to accommodate the workload. These nodes will scale back down after the workload is complete. ``` ## Deploy a Ray cluster Now we're ready to deploy the Ray cluster with the configuration that's defined above. Before running the command, make sure your aws credentials are configured correctly. ```shell ray up -y cluster.yaml ``` A Ray head node and 9 Ray worker nodes will be created. ## Run the workload We will use {ref}`Ray Job Submission ` to kick off the workload. ### Connect to the cluster First, we connect to the Job server. Run the following blocking command in a separate shell. ```shell ray dashboard cluster.yaml ``` This will forward remote port 8265 to port 8265 on localhost. ### Submit the workload We'll use the {ref}`Ray Job Python SDK ` to submit the XGBoost workload. ```{literalinclude} /cluster/doc_code/xgboost_submit.py :language: python ``` To submit the workload, run the above Python script. The script is available [in the Ray repository][XGBSubmit]. ```shell # Download the above script. curl https://raw.githubusercontent.com/ray-project/ray/releases/2.0.0/doc/source/cluster/doc_code/xgboost_submit.py -o xgboost_submit.py # Run the script. python xgboost_submit.py ``` ### Observe progress The benchmark may take up to 30 minutes to run. Use the following tools to observe its progress. #### Job logs To follow the job's logs, use the command printed by the above submission script. ```shell # Subsitute the Ray Job's submission id. ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address="http://localhost:8265" --follow ``` #### Ray Dashboard View `localhost:8265` in your browser to access the Ray Dashboard. #### Ray Status Observe autoscaling status and Ray resource usage with ```shell ray exec cluster.yaml 'ray status' ``` ### Job completion #### Benchmark results Once the benchmark is complete, the job log will display the results: ``` Results: {'training_time': 1338.488839321999, 'prediction_time': 403.36653568099973} ``` The performance of the benchmark is sensitive to the underlying cloud infrastructure -- you might not match {ref}`the numbers quoted in the benchmark docs `. #### Model parameters The file `model.json` in the Ray head node contains the parameters for the trained model. Other result data will be available in the directory `ray_results` in the head node. Refer to the {ref}`XGBoostTrainer documentation ` for details. ```{admonition} Scale-down If autoscaling is enabled, Ray worker nodes will scale down after the specified idle timeout. ``` #### Clean-up Delete your Ray cluster with the following command: ```shell ray down -y cluster.yaml ``` [XGBSubmit]: https://github.com/ray-project/ray/blob/releases/2.0.0/doc/source/cluster/doc_code/xgboost_submit.py