ray.data.Dataset.drop_columns#

Dataset.drop_columns(cols: List[str], *, compute: str | None = None, concurrency: int | None = None, **ray_remote_args) Dataset[source]#

Drop one or more columns from the dataset.

Examples

>>> import ray
>>> ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet")
>>> ds.schema()
Column        Type
------        ----
sepal.length  double
sepal.width   double
petal.length  double
petal.width   double
variety       string
>>> ds.drop_columns(["variety"]).schema()
Column        Type
------        ----
sepal.length  double
sepal.width   double
petal.length  double
petal.width   double

Time complexity: O(dataset size / parallelism)

Parameters:
  • cols – Names of the columns to drop. If any name does not exist, an exception is raised. Column names must be unique. When the input schema is known statically, missing columns are reported at the drop_columns call; otherwise the error surfaces during materialization.

  • compute – This argument is deprecated. Use concurrency argument.

  • concurrency – The maximum number of Ray workers to use concurrently.

  • **ray_remote_args – Additional resource requirements to request from Ray (e.g., num_gpus=1 to request GPUs for the map tasks). See ray.remote() for details.

Returns:

A new Dataset with the specified columns removed.