ray.data.Dataset.write_mongo#

Dataset.write_mongo(uri: str, database: str, collection: str, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None) → None[source]#

Writes the Dataset to a MongoDB database.

This method is only supported for datasets convertible to pyarrow tables.

The number of parallel writes is determined by the number of blocks in the dataset. To control the number of number of blocks, call repartition().

Warning

This method supports only a subset of the PyArrow’s types, due to the limitation of pymongoarrow which is used underneath. Writing unsupported types fails on type checking. See all the supported types at: https://mongo-arrow.readthedocs.io/en/stable/api/types.html.

Note

The records are inserted into MongoDB as new documents. If a record has the _id field, this _id must be non-existent in MongoDB, otherwise the write is rejected and fail (hence preexisting documents are protected from being mutated). It’s fine to not have _id field in record and MongoDB will auto generate one at insertion.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

import ray

ds = ray.data.range(100)
ds.write_mongo(
    uri="mongodb://username:[email protected]:27017/?authSource=admin",
    database="my_db",
    collection="my_collection"
)

Parameters:

uri – The URI to the destination MongoDB where the dataset is written to. For the URI format, see details in the MongoDB docs.
database – The name of the database. This database must exist otherwise a ValueError is raised.
collection – The name of the collection in the database. This collection must exist otherwise a ValueError is raised.
ray_remote_args – kwargs passed to ray.remote() in the write tasks.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.

Raises:

ValueError – if database doesn’t exist.
ValueError – if collection doesn’t exist.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.