class ray.air.DatasetConfig(fit: Optional[bool] = None, split: Optional[bool] = None, required: Optional[bool] = None, transform: Optional[bool] = None, max_object_store_memory_fraction: Optional[float] = None, global_shuffle: Optional[bool] = None, randomize_block_order: Optional[bool] = None, per_epoch_preprocessor: Optional[ray.data.preprocessor.Preprocessor] = None, use_stream_api: Optional[int] = None, stream_window_size: Optional[int] = None)[source]#

Bases: object

Configuration for ingest of a single Dataset.

See the AIR Dataset configuration guide for usage examples.

This config defines how the Dataset should be read into the DataParallelTrainer. It configures the preprocessing, splitting, and ingest strategy per-dataset.

DataParallelTrainers declare default DatasetConfigs for each dataset passed in the datasets argument. Users have the opportunity to selectively override these configs by passing the dataset_config argument. Trainers can also define user customizable values (e.g., XGBoostTrainer doesn’t support streaming ingest).

  • fit – Whether to fit preprocessors on this dataset. This can be set on at most one dataset at a time. True by default for the “train” dataset only.

  • split – Whether the dataset should be split across multiple workers. True by default for the “train” dataset only.

  • required – Whether to raise an error if the Dataset isn’t provided by the user. False by default.

  • transform – Whether to transform the dataset with the fitted preprocessor. This must be enabled at least for the dataset that is fit. True by default.

  • [Experimental] (per_epoch_preprocessor) – The maximum fraction of Ray’s shared-memory object store to use for the dataset. The default value is -1, meaning that the preprocessed dataset should be cached, which may cause spilling if its size is larger than the object store’s capacity. Pipelined ingest (all other values, 0 or higher) is experimental. Note that the absolute memory capacity used is based on the object store capacity at invocation time; this does not currently cover autoscaling cases where the size of the cluster may change.

  • global_shuffle – Whether to enable global shuffle (per pipeline window in streaming mode). Note that this is an expensive all-to-all operation, and most likely you want to use local shuffle instead. See https://docs.ray.io/en/master/data/faq.html and https://docs.ray.io/en/master/ray-air/check-ingest.html. False by default.

  • randomize_block_order – Whether to randomize the iteration order over blocks. The main purpose of this is to prevent data fetching hotspots in the cluster when running many parallel workers / trials on the same data. We recommend enabling it always. True by default.

  • [Experimental] – A preprocessor to re-apply on each pass of the dataset. The main use case for this is to apply a random transform on a training dataset on each epoch. The per-epoch preprocessor will be applied after all other preprocessors and in parallel with the dataset consumer.

  • use_stream_api – Deprecated. Use max_object_store_memory_fraction instead.

  • stream_window_size – Deprecated. Use max_object_store_memory_fraction instead.

PublicAPI (beta): This API is in beta and may change before becoming stable.

fill_defaults() ray.air.config.DatasetConfig[source]#

Return a copy of this config with all default values filled in.

static merge(a: Dict[str, ray.air.config.DatasetConfig], b: Optional[Dict[str, ray.air.config.DatasetConfig]]) Dict[str, ray.air.config.DatasetConfig][source]#

Merge two given DatasetConfigs, the second taking precedence.


ValueError – if validation fails on the merged configs.

static validated(config: Dict[str, DatasetConfig], datasets: Dict[str, Dataset]) Dict[str, DatasetConfig][source]#

Validate the given config and datasets are usable.

Returns dict of validated configs with defaults filled out.