ray.data.Dataset.write_sql#

Dataset.write_sql(sql: str, connection_factory: Callable[[], Any], ray_remote_args: Dict[str, Any] | None = None, concurrency: int | None = None) → None[source]#

Write to a database that provides a Python DB API2-compliant connector.

Note

This method writes data in parallel using the DB API2 executemany method. To learn more about this method, see PEP 249.

Note

This operation will trigger execution of the lazy transformations performed on this dataset.

Examples

import sqlite3
import ray

connection = sqlite3.connect("example.db")
connection.cursor().execute("CREATE TABLE movie(title, year, score)")
dataset = ray.data.from_items([
    {"title": "Monty Python and the Holy Grail", "year": 1975, "score": 8.2},
    {"title": "And Now for Something Completely Different", "year": 1971, "score": 7.5}
])

dataset.write_sql(
    "INSERT INTO movie VALUES(?, ?, ?)", lambda: sqlite3.connect("example.db")
)

result = connection.cursor().execute("SELECT * FROM movie ORDER BY year")
print(result.fetchall())

[('And Now for Something Completely Different', 1971, 7.5), ('Monty Python and the Holy Grail', 1975, 8.2)]

Parameters:

sql – An INSERT INTO statement that specifies the table to write to. The number of parameters must match the number of columns in the table.
connection_factory – A function that takes no arguments and returns a Python DB API2 Connection object.
ray_remote_args – Keyword arguments passed to ray.remote() in the write tasks.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.