ray.data.read_sql
ray.data.read_sql#
- ray.data.read_sql(sql: str, connection_factory: Callable[[], Any], *, parallelism: int = - 1, ray_remote_args: Optional[Dict[str, Any]] = None) ray.data.dataset.Dataset [source]#
Read from a database that provides a Python DB API2-compliant connector.
Note
By default,
read_sql
launches multiple read tasks, and each task executes aLIMIT
andOFFSET
to fetch a subset of the rows. However, for many databases,OFFSET
is slow.As a workaround, set
parallelism=1
to directly fetch all rows in a single task. Note that this approach requires all result rows to fit in the memory of single task. If the rows don’t fit, your program may raise an out of memory error.Examples
For examples of reading from larger databases like MySQL and PostgreSQL, see Reading from SQL Databases.
import sqlite3 import ray # Create a simple database connection = sqlite3.connect("example.db") connection.execute("CREATE TABLE movie(title, year, score)") connection.execute( """ INSERT INTO movie VALUES ('Monty Python and the Holy Grail', 1975, 8.2), ("Monty Python Live at the Hollywood Bowl", 1982, 7.9), ("Monty Python's Life of Brian", 1979, 8.0), ("Rocky II", 1979, 7.3) """ ) connection.commit() connection.close() def create_connection(): return sqlite3.connect("example.db") # Get all movies ds = ray.data.read_sql("SELECT * FROM movie", create_connection) # Get movies after the year 1980 ds = ray.data.read_sql( "SELECT title, score FROM movie WHERE year >= 1980", create_connection ) # Get the number of movies per year ds = ray.data.read_sql( "SELECT year, COUNT(*) FROM movie GROUP BY year", create_connection )
- Parameters
sql – The SQL query to execute.
connection_factory – A function that takes no arguments and returns a Python DB API2 Connection object.
parallelism – The requested parallelism of the read. Defaults to -1, which automatically determines the optimal parallelism for your configuration. You should not need to manually set this value in most cases. For details on how the parallelism is automatically determined and guidance on how to tune it, see Tuning read parallelism.
ray_remote_args – kwargs passed to
remote()
in the read tasks.
- Returns
A
Dataset
containing the queried data.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.