ray.data.DatasetIterator.to_tf#

DatasetIterator.to_tf(feature_columns: Union[str, List[str]], label_columns: Union[str, List[str]], *, prefetch_blocks: int = 0, batch_size: int = 1, drop_last: bool = False, local_shuffle_buffer_size: Optional[int] = None, local_shuffle_seed: Optional[int] = None) tf.data.Dataset[source]#

Return a TF Dataset over this dataset.

Warning

If your dataset contains ragged tensors, this method errors. To prevent errors, resize tensors or disable tensor extension casting.

Examples

>>> import ray
>>> ds = ray.data.read_csv(
...     "s3://anonymous@air-example-data/iris.csv"
... )
>>> it = ds.iterator(); it
DatasetIterator(Dataset(num_blocks=1, num_rows=150, schema={sepal length (cm): double, sepal width (cm): double, petal length (cm): double, petal width (cm): double, target: int64}))

If your model accepts a single tensor as input, specify a single feature column.

>>> it.to_tf(feature_columns="sepal length (cm)", label_columns="target")  
<_OptionsDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.float64, name='sepal length (cm)'), TensorSpec(shape=(None,), dtype=tf.int64, name='target'))>

If your model accepts a dictionary as input, specify a list of feature columns.

>>> it.to_tf(["sepal length (cm)", "sepal width (cm)"], "target")  
<_OptionsDataset element_spec=({'sepal length (cm)': TensorSpec(shape=(None,), dtype=tf.float64, name='sepal length (cm)'), 'sepal width (cm)': TensorSpec(shape=(None,), dtype=tf.float64, name='sepal width (cm)')}, TensorSpec(shape=(None,), dtype=tf.int64, name='target'))>

If your dataset contains multiple features but your model accepts a single tensor as input, combine features with Concatenator.

>>> from ray.data.preprocessors import Concatenator
>>> preprocessor = Concatenator(output_column_name="features", exclude="target")
>>> it = preprocessor.transform(ds).iterator()
>>> it
DatasetIterator(Dataset(num_blocks=1, num_rows=150, schema={target: int64, features: TensorDtype(shape=(4,), dtype=float64)}))
>>> it.to_tf("features", "target")  
<_OptionsDataset element_spec=(TensorSpec(shape=(None, 4), dtype=tf.float64, name='features'), TensorSpec(shape=(None,), dtype=tf.int64, name='target'))>
Parameters
  • feature_columns – Columns that correspond to model inputs. If this is a string, the input data is a tensor. If this is a list, the input data is a dict that maps column names to their tensor representation.

  • label_column – Columns that correspond to model targets. If this is a string, the target data is a tensor. If this is a list, the target data is a dict that maps column names to their tensor representation.

  • prefetch_blocks – The number of blocks to prefetch ahead of the current block during the scan.

  • batch_size – Record batch size. Defaults to 1.

  • drop_last – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. Defaults to False.

  • local_shuffle_buffer_size – If non-None, the data will be randomly shuffled using a local in-memory shuffle buffer, and this value will serve as the minimum number of rows that must be in the local in-memory shuffle buffer in order to yield a batch. When there are no more rows to add to the buffer, the remaining rows in the buffer will be drained. This buffer size must be greater than or equal to batch_size, and therefore batch_size must also be specified when using local shuffling.

  • local_shuffle_seed – The seed to use for the local random shuffle.

Returns

A tf.data.Dataset that yields inputs and targets.