ray.data.Dataset.drop_columns#

Dataset.drop_columns(cols: List[str], *, compute: str | None = None, **ray_remote_args) Dataset[source]#

Drop one or more columns from the dataset.

Examples

>>> import ray
>>> ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet")
>>> ds.schema()
Column        Type
------        ----
sepal.length  double
sepal.width   double
petal.length  double
petal.width   double
variety       string
>>> ds.drop_columns(["variety"]).schema()
Column        Type
------        ----
sepal.length  double
sepal.width   double
petal.length  double
petal.width   double

Time complexity: O(dataset size / parallelism)

Parameters:
  • cols – Names of the columns to drop. If any name does not exist, an exception is raised.

  • compute – The compute strategy, either “tasks” (default) to use Ray tasks, ray.data.ActorPoolStrategy(size=n) to use a fixed-size actor pool, or ray.data.ActorPoolStrategy(min_size=m, max_size=n) for an autoscaling actor pool.

  • ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).