ray.data.Dataset.to_torch#
- Dataset.to_torch(*, label_column: str | None = None, feature_columns: List[str] | List[List[str]] | Dict[str, List[str]] | None = None, label_column_dtype: torch.dtype | None = None, feature_column_dtypes: torch.dtype | List[torch.dtype] | Dict[str, torch.dtype] | None = None, batch_size: int = 1, prefetch_batches: int = 1, drop_last: bool = False, local_shuffle_buffer_size: int | None = None, local_shuffle_seed: int | None = None, unsqueeze_label_tensor: bool = True, unsqueeze_feature_tensors: bool = True) torch.utils.data.IterableDataset [source]#
Return a Torch IterableDataset over this
Dataset
.This is only supported for datasets convertible to Arrow records.
It is recommended to use the returned
IterableDataset
directly instead of passing it into a torchDataLoader
.Each element in
IterableDataset
is a tuple consisting of 2 elements. The first item contains the feature tensor(s), and the second item is the label tensor. Those can take on different forms, depending on the specified arguments.For the features tensor (N is the
batch_size
and n, m, k are the number of features per tensor):If
feature_columns
is aList[str]
, the features is a tensor of shape (N, n), with columns corresponding tofeature_columns
If
feature_columns
is aList[List[str]]
, the features is a list of tensors of shape [(N, m),…,(N, k)], with columns of each tensor corresponding to the elements offeature_columns
If
feature_columns
is aDict[str, List[str]]
, the features is a dict of key-tensor pairs of shape {key1: (N, m),…, keyN: (N, k)}, with columns of each tensor corresponding to the value offeature_columns
under the key.
If
unsqueeze_label_tensor=True
(default), the label tensor is of shape (N, 1). Otherwise, it is of shape (N,). Iflabel_column
is specified asNone
, then no column from theDataset
is treated as the label, and the output label tensor isNone
.Note that you probably want to call
Dataset.split()
on this dataset if there are to be multiple Torch workers consuming the data.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Time complexity: O(1)
- Parameters:
label_column – The name of the column used as the label (second element of the output list). Can be None for prediction, in which case the second element of returned tuple will also be None.
feature_columns – The names of the columns to use as the features. Can be a list of lists or a dict of string-list pairs for multi-tensor output. If
None
, then use all columns except the label column as the features.label_column_dtype – The torch dtype to use for the label column. If
None
, then automatically infer the dtype.feature_column_dtypes – The dtypes to use for the feature tensors. This should match the format of
feature_columns
, or be a single dtype, in which case it is applied to all tensors. IfNone
, then automatically infer the dtype.batch_size – How many samples per batch to yield at a time. Defaults to 1.
prefetch_batches – The number of batches to fetch ahead of the current batch to fetch. If set to greater than 0, a separate threadpool is used to fetch the objects to the local node, format the batches, and apply the collate_fn. Defaults to 1.
drop_last – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of the stream is not divisible by the batch size, then the last batch is smaller. Defaults to False.
local_shuffle_buffer_size – If non-None, the data is randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer is drained. This buffer size must be greater than or equal to
batch_size
, and thereforebatch_size
must also be specified when using local shuffling.local_shuffle_seed – The seed to use for the local random shuffle.
unsqueeze_label_tensor – If set to True, the label tensor is unsqueezed (reshaped to (N, 1)). Otherwise, it will be left as is, that is (N, ). In general, regression loss functions expect an unsqueezed tensor, while classification loss functions expect a squeezed one. Defaults to True.
unsqueeze_feature_tensors – If set to True, the features tensors are unsqueezed (reshaped to (N, 1)) before being concatenated into the final features tensor. Otherwise, they are left as is, that is (N, ). Defaults to True.
- Returns:
Warning
DEPRECATED: This API is deprecated and may be removed in future Ray releases.