{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(mmt-core)=\n", "\n", "# Batch Training with Ray Core" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{tip}\n", "The workload showcased in this notebook can be expressed using different Ray components, such as Ray Data, Ray Tune and Ray Core.\n", "For best practices, see {ref}`ref-use-cases-mmt`.\n", "```\n", "\n", "Batch training and tuning are common tasks in simple machine learning use-cases such as time series forecasting. They require fitting of simple models on multiple data batches corresponding to locations, products, etc. This notebook showcases how to conduct batch training on the [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) using only Ray Core and stateless Ray tasks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Batch training in the context of this notebook is understood as creating the same model(s) for different and separate datasets or subsets of a dataset. This task is naively parallelizable and can be easily scaled with Ray.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contents\n", "In this tutorial, we will walk through the following steps:\n", " 1. Reading parquet data,\n", " 2. Using Ray tasks to preprocess, train and evaluate data batches,\n", " 3. Dividing data into batches and spawning a Ray task for each batch to be run in parallel,\n", " 4. Starting batch training,\n", " 5. [Optional] Optimizing for runtime over memory with centralized data loading.\n", "\n", "# Walkthrough\n", "\n", "We want to analyze the relationship between the dropoff location and the trip duration. The relationship will be very different for each pickup location, therefore we need to have a separate model for each of those. Furthermore, the relationship can change with time. Therefore, our task is to create separate models for each pickup location-month combination. The dataset we are using is already partitioned into months (each file being equal to one), and we can use the `pickup_location_id` column in the dataset to group it into data batches. We will then fit models for each batch and choose the best one.\n", "\n", "Let’s start by importing Ray and initializing a local Ray cluster." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from typing import Callable, Optional, List, Union, Tuple, Iterable\n", "import time\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.base import BaseEstimator\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "import pyarrow as pa\n", "from pyarrow import fs\n", "from pyarrow import dataset as ds\n", "from pyarrow import parquet as pq\n", "import pyarrow.compute as pc" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Python version: | \n", "3.8.13 | \n", "
Ray version: | \n", "2.5.0 | \n", "
Dashboard: | \n", "http://console.anyscale-staging.com/api/v2/sessions/ses_ZmHebxHaZpYkw9x9efJ5wBVX/services?redirect_to=dashboard | \n", "