ray.data.Dataset.to_torch#

Dataset.to_torch(*, label_column: str | None = None, feature_columns: List[str] | List[List[str]] | Dict[str, List[str]] | None = None, label_column_dtype: torch.dtype | None = None, feature_column_dtypes: torch.dtype | List[torch.dtype] | Dict[str, torch.dtype] | None = None, batch_size: int = 1, prefetch_batches: int = 1, drop_last: bool = False, local_shuffle_buffer_size: int | None = None, local_shuffle_seed: int | None = None, unsqueeze_label_tensor: bool = True, unsqueeze_feature_tensors: bool = True) torch.utils.data.IterableDataset[source]#

Return a Torch IterableDataset over this Dataset.

This is only supported for datasets convertible to Arrow records.

It is recommended to use the returned IterableDataset directly instead of passing it into a torch DataLoader.

Each element in IterableDataset is a tuple consisting of 2 elements. The first item contains the feature tensor(s), and the second item is the label tensor. Those can take on different forms, depending on the specified arguments.

For the features tensor (N is the batch_size and n, m, k are the number of features per tensor):

  • If feature_columns is a List[str], the features is a tensor of shape (N, n), with columns corresponding to feature_columns

  • If feature_columns is a List[List[str]], the features is a list of tensors of shape [(N, m),…,(N, k)], with columns of each tensor corresponding to the elements of feature_columns

  • If feature_columns is a Dict[str, List[str]], the features is a dict of key-tensor pairs of shape {key1: (N, m),…, keyN: (N, k)}, with columns of each tensor corresponding to the value of feature_columns under the key.

If unsqueeze_label_tensor=True (default), the label tensor is of shape (N, 1). Otherwise, it is of shape (N,). If label_column is specified as None, then no column from the Dataset is treated as the label, and the output label tensor is None.

Note that you probably want to call Dataset.split() on this dataset if there are to be multiple Torch workers consuming the data.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Time complexity: O(1)

Parameters:
  • label_column – The name of the column used as the label (second element of the output list). Can be None for prediction, in which case the second element of returned tuple will also be None.

  • feature_columns – The names of the columns to use as the features. Can be a list of lists or a dict of string-list pairs for multi-tensor output. If None, then use all columns except the label column as the features.

  • label_column_dtype – The torch dtype to use for the label column. If None, then automatically infer the dtype.

  • feature_column_dtypes – The dtypes to use for the feature tensors. This should match the format of feature_columns, or be a single dtype, in which case it is applied to all tensors. If None, then automatically infer the dtype.

  • batch_size – How many samples per batch to yield at a time. Defaults to 1.

  • prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool is used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1.

  • drop_last – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of the stream is not divisible by the batch size, then the last batch is smaller. Defaults to False.

  • local_shuffle_buffer_size – If non-None, the data is randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer is drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling.

  • local_shuffle_seed – The seed to use for the local random shuffle.

  • unsqueeze_label_tensor – If set to True, the label tensor is unsqueezed (reshaped to (N, 1)). Otherwise, it will be left as is, that is (N, ). In general, regression loss functions expect an unsqueezed tensor, while classification loss functions expect a squeezed one. Defaults to True.

  • unsqueeze_feature_tensors – If set to True, the features tensors are unsqueezed (reshaped to (N, 1)) before being concatenated into the final features tensor. Otherwise, they are left as is, that is (N, ). Defaults to True.

Returns:

A Torch IterableDataset.

Warning

DEPRECATED: This API is deprecated and may be removed in future Ray releases.