Dataset.write_mongo(uri: str, database: str, collection: str, ray_remote_args: Optional[Dict[str, Any]] = None) None[source]#

Write the dataset to a MongoDB datasource.

This is only supported for datasets convertible to Arrow records. To control the number of parallel write tasks, use .repartition() before calling this method.


Currently, this supports only a subset of the pyarrow’s types, due to the limitation of pymongoarrow which is used underneath. Writing unsupported types will fail on type checking. See all the supported types at: https://mongo-arrow.readthedocs.io/en/latest/supported_types.html.


The records will be inserted into MongoDB as new documents. If a record has the _id field, this _id must be non-existent in MongoDB, otherwise the write will be rejected and fail (hence preexisting documents are protected from being mutated). It’s fine to not have _id field in record and MongoDB will auto generate one at insertion.


This operation will trigger execution of the lazy transformations performed on this dataset, and will block until execution completes.


>>> import ray
>>> import pandas as pd
>>> docs = [{"title": "MongoDB Datasource test"} for key in range(4)]
>>> ds = ray.data.from_pandas(pd.DataFrame(docs))
>>> ds.write_mongo( 
>>>     MongoDatasource(), 
>>>     uri="mongodb://username:[email protected]:27017/?authSource=admin", # noqa: E501 
>>>     database="my_db", 
>>>     collection="my_collection", 
>>> ) 
  • uri – The URI to the destination MongoDB where the dataset will be written to. For the URI format, see details in https://www.mongodb.com/docs/manual/reference/connection-string/.

  • database – The name of the database. This database must exist otherwise ValueError will be raised.

  • collection – The name of the collection in the database. This collection must exist otherwise ValueError will be raised.

  • ray_remote_args – Kwargs passed to ray.remote in the write tasks.