ray.data.read_tfrecords#

ray.data.read_tfrecords(paths: str | List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, parallelism: int = -1, num_cpus: float | None = None, num_gpus: float | None = None, memory: float | None = None, ray_remote_args: Dict[str, Any] = None, arrow_open_stream_args: Dict[str, Any] | None = None, meta_provider: BaseFileMetadataProvider | None = None, partition_filter: PathPartitionFilter | None = None, include_paths: bool = False, ignore_missing_paths: bool = False, tf_schema: schema_pb2.Schema | None = None, shuffle: Literal['files'] | FileShuffleConfig | None = None, file_extensions: List[str] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None, tfx_read_options: TFXReadOptions | None = None) → Dataset[source]#

Create a Dataset from TFRecord files that contain tf.train.Example messages.

Tip

Using the tfx-bsl library is more performant when reading large datasets (for example, in production use cases). To use this implementation, you must first install tfx-bsl:

pip install tfx_bsl --no-dependencies
Pass tfx_read_options to read_tfrecords, for example: ds = read_tfrecords(path, ..., tfx_read_options=TFXReadOptions())

Warning

This function exclusively supports tf.train.Example messages. If a file contains a message that isn’t of type tf.train.Example, then this function fails.

Examples

>>> import ray
>>> ray.data.read_tfrecords("s3://anonymous@ray-example-data/iris.tfrecords")
Dataset(num_rows=?, schema=...)

We can also read compressed TFRecord files, which use one of the compression types supported by Arrow:

>>> ray.data.read_tfrecords(
...     "s3://anonymous@ray-example-data/iris.tfrecords.gz",
...     arrow_open_stream_args={"compression": "gzip"},
... )
Dataset(num_rows=?, schema=...)

Parameters:

paths – A single file or directory, or a list of file or directory paths. A list of paths can contain both files and directories.
filesystem – The PyArrow filesystem implementation to read from. These filesystems are specified in the PyArrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with s3://, the S3FileSystem is used.
parallelism – This argument is deprecated. Use override_num_blocks argument.
num_cpus – The number of CPUs to reserve for each parallel read worker.
num_gpus – The number of GPUs to reserve for each parallel read worker. For example, specify num_gpus=1 to request 1 GPU for each parallel read worker.
memory – The heap memory in bytes to reserve for each parallel read worker.
ray_remote_args – kwargs passed to ray.remote() in the read tasks.
arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file. when opening input files to read. To read a compressed TFRecord file, pass the corresponding compression type (e.g., for GZIP or ZLIB), use arrow_open_stream_args={'compression': 'gzip'}).
meta_provider – [Deprecated] A file metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. In most cases, you do not need to set this. If None, this function uses a system-chosen implementation.
partition_filter – A PathPartitionFilter. Use with a custom callback to read only selected partitions of a dataset.
include_paths – If True, include the path to each file. File paths are stored in the 'path' column.
ignore_missing_paths – If True, ignores any file paths in paths that are not found. Defaults to False.
tf_schema – Optional TensorFlow Schema which is used to explicitly set the schema of the underlying Dataset.
shuffle – If setting to “files”, randomly shuffle input files order before read. If setting to FileShuffleConfig, you can pass a seed to shuffle the input files. Defaults to not shuffle with None.
file_extensions – A list of file extensions to filter files by.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
tfx_read_options – Specifies read options when reading TFRecord files with TFX. When no options are provided, the default version without tfx-bsl will be used to read the tfrecords.

Returns:

A Dataset that contains the example features.

Raises:

ValueError – If a file contains a message that isn’t a tf.train.Example.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.