ray.data.Dataset.select_columns
ray.data.Dataset.select_columns#
- Dataset.select_columns(cols: List[str], *, compute: Optional[Union[str, ray.data._internal.compute.ComputeStrategy]] = None, **ray_remote_args) ray.data.dataset.Dataset [source]#
Select one or more columns from the dataset.
Specified columns must be in the dataset schema.
Examples
>>> import ray >>> ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") >>> ds.schema() Column Type ------ ---- sepal.length double sepal.width double petal.length double petal.width double variety string >>> ds.select_columns(["sepal.length", "sepal.width"]).schema() Column Type ------ ---- sepal.length double sepal.width double
Time complexity: O(dataset size / parallelism)
- Parameters
cols – Names of the columns to select. If a name isn’t in the dataset schema, an exception is raised.
compute – The compute strategy, either “tasks” (default) to use Ray tasks,
ray.data.ActorPoolStrategy(size=n)
to use a fixed-size actor pool, orray.data.ActorPoolStrategy(min_size=m, max_size=n)
for an autoscaling actor pool.ray_remote_args – Additional resource requirements to request from ray (e.g., num_gpus=1 to request GPUs for the map tasks).