Saving Data#
Ray Data lets you save data in files or other Python objects.
This guide shows you how to:
Writing data to files#
Ray Data writes to local disk and cloud storage.
Writing data to local disk#
To save your Dataset
to local disk, call a method
like Dataset.write_parquet
and specify a local
directory with the local://
scheme.
Warning
If your cluster contains multiple nodes and you don’t use local://
, Ray Data
writes different partitions of data to different nodes.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
ds.write_parquet("local:///tmp/iris/")
To write data to formats other than Parquet, read the Input/Output reference.
Writing data to cloud storage#
To save your Dataset
to cloud storage, authenticate all nodes
with your cloud service provider. Then, call a method like
Dataset.write_parquet
and specify a URI with
the appropriate scheme. URI can point to buckets or folders.
To write data to formats other than Parquet, read the Input/Output reference.
To save data to Amazon S3, specify a URI with the s3://
scheme.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
ds.write_parquet("s3://my-bucket/my-folder")
Ray Data relies on PyArrow for authenticaion with Amazon S3. For more on how to configure your credentials to be compatible with PyArrow, see their S3 Filesytem docs.
To save data to Google Cloud Storage, install the Filesystem interface to Google Cloud Storage
pip install gcsfs
Then, create a GCSFileSystem
and specify a URI with the gcs://
scheme.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
filesystem = gcsfs.GCSFileSystem(project="my-google-project")
ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem)
Ray Data relies on PyArrow for authenticaion with Google Cloud Storage. For more on how to configure your credentials to be compatible with PyArrow, see their GCS Filesytem docs.
To save data to Azure Blob Storage, install the Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage
pip install adlfs
Then, create a AzureBlobFileSystem
and specify a URI with the az://
scheme.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
filesystem = adlfs.AzureBlobFileSystem(account_name="azureopendatastorage")
ds.write_parquet("az://my-bucket/my-folder", filesystem=filesystem)
Ray Data relies on PyArrow for authenticaion with Azure Blob Storage. For more on how to configure your credentials to be compatible with PyArrow, see their fsspec-compatible filesystems docs.
Writing data to NFS#
To save your Dataset
to NFS file systems, call a method
like Dataset.write_parquet
and specify a
mounted directory.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
ds.write_parquet("/mnt/cluster_storage/iris")
To write data to formats other than Parquet, read the Input/Output reference.
Changing the number of output files#
When you call a write method, Ray Data writes your data to several files. To control the
number of output files, configure num_rows_per_file
.
Note
num_rows_per_file
is a hint, not a strict limit. Ray Data might write more or
fewer rows to each file.
import os
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
ds.write_csv("/tmp/few_files/", num_rows_per_file=75)
print(os.listdir("/tmp/few_files/"))
['0_000001_000000.csv', '0_000000_000000.csv', '0_000002_000000.csv']
Converting Datasets to other Python libraries#
Converting Datasets to pandas#
To convert a Dataset
to a pandas DataFrame, call
Dataset.to_pandas()
. Your data must fit in memory
on the head node.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
df = ds.to_pandas()
print(df)
sepal length (cm) sepal width (cm) ... petal width (cm) target
0 5.1 3.5 ... 0.2 0
1 4.9 3.0 ... 0.2 0
2 4.7 3.2 ... 0.2 0
3 4.6 3.1 ... 0.2 0
4 5.0 3.6 ... 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 ... 2.3 2
146 6.3 2.5 ... 1.9 2
147 6.5 3.0 ... 2.0 2
148 6.2 3.4 ... 2.3 2
149 5.9 3.0 ... 1.8 2
<BLANKLINE>
[150 rows x 5 columns]
Converting Datasets to distributed DataFrames#
Ray Data interoperates with distributed data processing frameworks like Dask, Spark, Modin, and Mars.
To convert a Dataset
to a
Dask DataFrame, call
Dataset.to_dask()
.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
df = ds.to_dask()
To convert a Dataset
to a Spark DataFrame,
call Dataset.to_spark()
.
import ray
import raydp
spark = raydp.init_spark(
app_name = "example",
num_executors = 1,
executor_cores = 4,
executor_memory = "512M"
)
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
df = ds.to_spark(spark)
To convert a Dataset
to a Modin DataFrame, call
Dataset.to_modin()
.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
mdf = ds.to_modin()
To convert a Dataset
from a Mars DataFrame, call
Dataset.to_mars()
.
import ray
ds = ray.data.read_csv("s3://anonymous@ray-example-data/iris.csv")
mdf = ds.to_mars()