ray.data.read_clickhouse#

ray.data.read_clickhouse(*, table: str, dsn: str, columns: List[str] | None = None, filter: str | None = None, order_by: Tuple[List[str], bool] | None = None, client_settings: Dict[str, Any] | None = None, client_kwargs: Dict[str, Any] | None = None, ray_remote_args: Dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) → Dataset[source]#

Create a Dataset from a ClickHouse table or view.

Examples

>>> import ray
>>> ds = ray.data.read_clickhouse( 
...     table="default.table",
...     dsn="clickhouse+http://username:password@host:8124/default",
...     columns=["timestamp", "age", "status", "text", "label"],
...     filter="age > 18 AND status = 'active'",
...     order_by=(["timestamp"], False),
... )

Parameters:

table – Fully qualified table or view identifier (e.g., “default.table_name”).
dsn – A string in standard DSN (Data Source Name) HTTP format (e.g., “clickhouse+http://username:password@host:8124/default”). For more information, see ClickHouse Connection String doc.
columns – Optional list of columns to select from the data source. If no columns are specified, all columns will be selected by default.
filter – Optional SQL filter string that will be used in the WHERE statement (e.g., “label = 2 AND text IS NOT NULL”). The filter string must be valid for use in a ClickHouse SQL WHERE clause. Please Note: Parallel reads are not currently supported when a filter is set. Specifying a filter forces the parallelism to 1 to ensure deterministic and consistent results. For more information, see ClickHouse SQL WHERE Clause doc.
order_by – Optional tuple containing a list of columns to order by and a boolean indicating whether the order should be descending (True for DESC, False for ASC). Please Note: order_by is required to support parallelism. If not provided, the data will be read in a single task. This is to ensure that the data is read in a consistent order across all tasks.
client_settings – Optional ClickHouse server settings to be used with the session/every request. For more information, see ClickHouse Client Settings.
client_kwargs – Optional additional arguments to pass to the ClickHouse client. For more information, see ClickHouse Core Settings.
ray_remote_args – kwargs passed to ray.remote() in the read tasks.
concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.
override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

A Dataset producing records read from the ClickHouse table or view.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.