ray.data.read_tfrecords#
- ray.data.read_tfrecords(paths: str | List[str], *, filesystem: pyarrow.fs.FileSystem | None = None, parallelism: int = -1, arrow_open_stream_args: Dict[str, Any] | None = None, meta_provider: BaseFileMetadataProvider | None = None, partition_filter: PathPartitionFilter | None = None, include_paths: bool = False, ignore_missing_paths: bool = False, tf_schema: schema_pb2.Schema | None = None, shuffle: Literal['files'] | FileShuffleConfig | None = None, file_extensions: List[str] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None, tfx_read_options: TFXReadOptions | None = None) Dataset [source]#
Create a
Dataset
from TFRecord files that contain tf.train.Example messages.Tip
Using the
tfx-bsl
library is more performant when reading large datasets (for example, in production use cases). To use this implementation, you must first installtfx-bsl
:pip install tfx_bsl --no-dependencies
Pass tfx_read_options to read_tfrecords, for example:
ds = read_tfrecords(path, ..., tfx_read_options=TFXReadOptions())
Warning
This function exclusively supports
tf.train.Example
messages. If a file contains a message that isn’t of typetf.train.Example
, then this function fails.Examples
>>> import ray >>> ray.data.read_tfrecords("s3://anonymous@ray-example-data/iris.tfrecords") Dataset( num_rows=?, schema={...} )
We can also read compressed TFRecord files, which use one of the compression types supported by Arrow:
>>> ray.data.read_tfrecords( ... "s3://anonymous@ray-example-data/iris.tfrecords.gz", ... arrow_open_stream_args={"compression": "gzip"}, ... ) Dataset( num_rows=?, schema={...} )
- Parameters:
paths – A single file or directory, or a list of file or directory paths. A list of paths can contain both files and directories.
filesystem – The PyArrow filesystem implementation to read from. These filesystems are specified in the PyArrow docs. Specify this parameter if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
s3://
, theS3FileSystem
is used.parallelism – This argument is deprecated. Use
override_num_blocks
argument.arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_input_file. when opening input files to read. To read a compressed TFRecord file, pass the corresponding compression type (e.g., for
GZIP
orZLIB
), usearrow_open_stream_args={'compression': 'gzip'}
).meta_provider – [Deprecated] A file metadata provider. Custom metadata providers may be able to resolve file metadata more quickly and/or accurately. In most cases, you do not need to set this. If
None
, this function uses a system-chosen implementation.partition_filter – A
PathPartitionFilter
. Use with a custom callback to read only selected partitions of a dataset.include_paths – If
True
, include the path to each file. File paths are stored in the'path'
column.ignore_missing_paths – If True, ignores any file paths in
paths
that are not found. Defaults to False.tf_schema – Optional TensorFlow Schema which is used to explicitly set the schema of the underlying Dataset.
shuffle – If setting to “files”, randomly shuffle input files order before read. If setting to
FileShuffleConfig
, you can pass a seed to shuffle the input files. Defaults to not shuffle withNone
.file_extensions – A list of file extensions to filter files by.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
tfx_read_options – Specifies read options when reading TFRecord files with TFX. When no options are provided, the default version without tfx-bsl will be used to read the tfrecords.
- Returns:
A
Dataset
that contains the example features.- Raises:
ValueError – If a file contains a message that isn’t a
tf.train.Example
.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.