ray.data.Dataset.write_numpy
ray.data.Dataset.write_numpy#
- Dataset.write_numpy(path: str, *, column: str, filesystem: Optional[pyarrow.fs.FileSystem] = None, try_create_dir: bool = True, arrow_open_stream_args: Optional[Dict[str, Any]] = None, block_path_provider: ray.data.datasource.block_path_provider.BlockWritePathProvider = <ray.data.datasource.block_path_provider.DefaultBlockWritePathProvider object>, ray_remote_args: Dict[str, Any] = None) None [source]#
Writes a column of the
Dataset
to .npy files.This is only supported for columns in the datasets that can be converted to NumPy arrays.
The number of files is determined by the number of blocks in the dataset. To control the number of number of blocks, call
repartition()
.By default, the format of the output files is
{uuid}_{block_idx}.npy
, whereuuid
is a unique id for the dataset. To modify this behavior, implement a customBlockWritePathProvider
and pass it in as theblock_path_provider
argument.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
>>> import ray >>> ds = ray.data.range(100) >>> ds.write_numpy("local:///tmp/data/", column="id")
Time complexity: O(dataset size / parallelism)
- Parameters
path – The path to the destination root directory, where the npy files are written to.
column – The name of the column that contains the data to be written.
filesystem – The pyarrow filesystem implementation to write to. These filesystems are specified in the pyarrow docs. Specify this if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
s3://
, theS3FileSystem
is used.try_create_dir – If
True
, attempts to create all directories in destination path. Does nothing if all directories already exist. Defaults toTrue
.arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream, which is used when opening the file to write to.
block_path_provider – A
BlockWritePathProvider
implementation specifying the filename structure for each output parquet file. By default, the format of the output files is{uuid}_{block_idx}.npy
, whereuuid
is a unique id for the dataset.ray_remote_args – kwargs passed to
remote()
in the write tasks.