ray.data.read_clickhouse#

ray.data.read_clickhouse(*, table: str, dsn: str, columns: List[str] | None = None, order_by: Tuple[List[str], bool] | None = None, client_settings: Dict[str, Any] | None = None, client_kwargs: Dict[str, Any] | None = None, ray_remote_args: Dict[str, Any] | None = None, concurrency: int | None = None, override_num_blocks: int | None = None) Dataset[source]#

Create a Dataset from a ClickHouse table or view.

Examples

>>> import ray
>>> ds = ray.data.read_clickhouse( 
...     table="default.table",
...     dsn="clickhouse+http://username:password@host:8124/default",
...     columns=["timestamp", "age", "status", "text", "label"],
...     order_by=(["timestamp"], False),
... )
Parameters:
  • table – Fully qualified table or view identifier (e.g., “default.table_name”).

  • dsn – A string in standard DSN (Data Source Name) HTTP format (e.g., “clickhouse+http://username:password@host:8124/default”). For more information, see ClickHouse Connection String doc.

  • columns – Optional list of columns to select from the data source. If no columns are specified, all columns will be selected by default.

  • order_by – Optional tuple containing a list of columns to order by and a boolean indicating whether the order should be descending (True for DESC, False for ASC). Please Note: order_by is required to support parallelism. If not provided, the data will be read in a single task. This is to ensure that the data is read in a consistent order across all tasks.

  • client_settings – Optional ClickHouse server settings to be used with the session/every request. For more information, see ClickHouse Client Settings.

  • client_kwargs – Optional additional arguments to pass to the ClickHouse client. For more information, see ClickHouse Core Settings.

  • ray_remote_args – kwargs passed to ray.remote() in the read tasks.

  • concurrency – The maximum number of Ray tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run or the total number of output blocks. By default, concurrency is dynamically decided based on the available resources.

  • override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.

Returns:

A Dataset producing records read from the ClickHouse table or view.

PublicAPI (alpha): This API is in alpha and may change before becoming stable.