Distributed fine-tuning of Llama 3.1 8B on AWS Trainium with Ray and PyTorch Lightning#

This example demonstrates how to fine-tune the Llama 3.1 8B model on AWS Trainium instances using Ray Train, PyTorch Lightning, and AWS Neuron SDK.

AWS Trainium is the machine learning (ML) chip that AWS built for deep learning (DL) training of 100B+ parameter models. AWS Neuron SDK helps developers train models on Trainium accelerators.

Prepare the environment#

See Setup EKS cluster and tools for setting up an Amazon EKS cluster leveraging AWS Trainium instances.

Create a Docker image#

When the EKS cluster is ready, create an Amazon ECR repository for building and uploading the Docker image containing artifacts for fine-tuning a Llama3.1 8B model:

Clone the repo.

git clone https://github.com/aws-neuron/aws-neuron-eks-samples.git

Go to the llama3.1_8B_finetune_ray_ptl_neuron directory.

cd aws-neuron-eks-samples/llama3.1_8B_finetune_ray_ptl_neuron

Trigger the script.

chmod +x 0-kuberay-trn1-llama3-finetune-build-image.sh
./0-kuberay-trn1-llama3-finetune-build-image.sh

Enter the zone your cluster is running in, for example: us-east-2.
Verify in the AWS console that the Amazon ECR service has the newly created kuberay_trn1_llama3.1_pytorch2 repository.
Update the ECR image ARN in the manifest file used for creating the Ray cluster.

Replace the <AWS_ACCOUNT_ID> and <REGION> placeholders with actual values in the 1-llama3-finetune-trn1-create-raycluster.yaml file using commands below to reflect the ECR image ARN created above:

export AWS_ACCOUNT_ID=<enter_your_aws_account_id> # for ex: 111222333444
export REGION=<enter_your_aws_region> # for ex: us-east-2
sed -i "s/<AWS_ACCOUNT_ID>/$AWS_ACCOUNT_ID/g" 1-llama3-finetune-trn1-create-raycluster.yaml
sed -i "s/<REGION>/$REGION/g" 1-llama3-finetune-trn1-create-raycluster.yaml

Configuring Ray Cluster#

The llama3.1_8B_finetune_ray_ptl_neuron directory in the AWS Neuron samples repository simplifies the Ray configuration. KubeRay provides a manifest that you can apply to the cluster to set up the head and worker pods.

Run the following command to set up the Ray cluster:

kubectl apply -f 1-llama3-finetune-trn1-create-raycluster.yaml

Accessing Ray Dashboard#

Port forward from the cluster to see the state of the Ray dashboard and then view it on http://localhost:8265. Run it in the background with the following command:

kubectl port-forward service/kuberay-trn1-head-svc 8265:8265 &

Launching Ray Jobs#

The Ray cluster now ready to handle workloads. Initiate the data preparation and fine-tuning Ray jobs:

Launch the Ray job for downloading the dolly-15k dataset and the Llama3.1 8B model artifacts:

kubectl apply -f 2-llama3-finetune-trn1-rayjob-create-data.yaml

When the job has executed successfully, run the following fine-tuning job:

kubectl apply -f 3-llama3-finetune-trn1-rayjob-submit-finetuning-job.yaml

Monitor the jobs via the Ray Dashboard

For detailed information on each of the steps above, see the AWS documentation link.