Global configuration#

class ray.data.DataContext(target_max_block_size: int = 134217728, target_shuffle_max_block_size: int = 1073741824, target_min_block_size: int = 1048576, streaming_read_buffer_size: int = 33554432, enable_pandas_block: bool = True, actor_prefetcher_enabled: bool = False, use_push_based_shuffle: bool = False, pipeline_push_based_shuffle_reduce_tasks: bool = True, scheduling_strategy: None | str | ~ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy | ~ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy | ~ray.util.scheduling_strategies.NodeLabelSchedulingStrategy = 'SPREAD', scheduling_strategy_large_args: None | str | ~ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy | ~ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy | ~ray.util.scheduling_strategies.NodeLabelSchedulingStrategy = 'DEFAULT', large_args_threshold: int = 52428800, use_polars: bool = False, eager_free: bool = True, decoding_size_estimation: bool = True, min_parallelism: int = 200, read_op_min_num_blocks: int = 200, enable_tensor_extension_casting: bool = True, use_arrow_tensor_v2: bool = True, enable_auto_log_stats: bool = False, verbose_stats_logs: bool = False, trace_allocations: bool = False, execution_options: ExecutionOptions = <factory>, use_ray_tqdm: bool = True, enable_progress_bars: bool = True, enable_operator_progress_bars: bool = True, enable_progress_bar_name_truncation: bool = True, enable_get_object_locations_for_metrics: bool = False, write_file_retry_on_errors: ~typing.List[str] = ('AWS Error INTERNAL_FAILURE', 'AWS Error NETWORK_CONNECTION', 'AWS Error SLOW_DOWN', 'AWS Error UNKNOWN (HTTP status 503)'), warn_on_driver_memory_usage_bytes: int = 2147483648, actor_task_retry_on_errors: bool | ~typing.List[BaseException] = False, op_resource_reservation_enabled: bool = True, op_resource_reservation_ratio: float = 0.5, max_errored_blocks: int = 0, log_internal_stack_trace_to_stdout: bool = False, raise_original_map_exception: bool = False, print_on_execution_start: bool = True, s3_try_create_dir: bool = False, wait_for_min_actors_s: int = 600, retried_io_errors: ~typing.List[str] = <factory>)[source]#

Global settings for Ray Data.

Configure this class to enable advanced features and tune performance.

Warning

Apply changes before creating a Dataset. Changes made after won’t take effect.

Note

This object is automatically propagated to workers. Access it from the driver and remote workers with DataContext.get_current().

Examples

>>> from ray.data import DataContext
>>> DataContext.get_current().enable_progress_bars = False
Parameters:
  • target_max_block_size – The max target block size in bytes for reads and transformations.

  • target_shuffle_max_block_size – The max target block size in bytes for shuffle ops like random_shuffle, sort, and repartition.

  • target_min_block_size – Ray Data avoids creating blocks smaller than this size in bytes on read. This takes precedence over read_op_min_num_blocks.

  • streaming_read_buffer_size – Buffer size when doing streaming reads from local or remote storage.

  • enable_pandas_block – Whether pandas block format is enabled.

  • actor_prefetcher_enabled – Whether to use actor based block prefetcher.

  • use_push_based_shuffle – Whether to use push-based shuffle.

  • pipeline_push_based_shuffle_reduce_tasks

  • scheduling_strategy – The global scheduling strategy. For tasks with large args, scheduling_strategy_large_args takes precedence.

  • scheduling_strategy_large_args – Scheduling strategy for tasks with large args.

  • large_args_threshold – Size in bytes after which point task arguments are considered large. Choose a value so that the data transfer overhead is significant in comparison to task scheduling (i.e., low tens of ms).

  • use_polars – Whether to use Polars for tabular dataset sorts, groupbys, and aggregations.

  • eager_free – Whether to eagerly free memory.

  • decoding_size_estimation – Whether to estimate in-memory decoding data size for data source.

  • min_parallelism – This setting is deprecated. Use read_op_min_num_blocks instead.

  • read_op_min_num_blocks – Minimum number of read output blocks for a dataset.

  • enable_tensor_extension_casting – Whether to automatically cast NumPy ndarray columns in Pandas DataFrames to tensor extension columns.

  • use_arrow_tensor_v2 – Config enabling V2 version of ArrowTensorArray supporting tensors > 2Gb in size (off by default)

  • enable_fallback_to_arrow_object_ext_type – Enables fallback to serialize column values not suppported by Arrow natively (like user-defined custom Python classes for ex, etc) using ArrowPythonObjectType (simply serializing these as bytes)

  • enable_auto_log_stats – Whether to automatically log stats after execution. If disabled, you can still manually print stats with Dataset.stats().

  • verbose_stats_logs – Whether stats logs should be verbose. This includes fields such as extra_metrics in the stats output, which are excluded by default.

  • trace_allocations – Whether to trace allocations / eager free. This adds significant performance overheads and should only be used for debugging.

  • execution_options – The ExecutionOptions to use.

  • use_ray_tqdm – Whether to enable distributed tqdm.

  • enable_progress_bars – Whether to enable progress bars.

  • enable_progress_bar_name_truncation – If True, the name of the progress bar (often the operator name) will be truncated if it exceeds ProgressBar.MAX_NAME_LENGTH. Otherwise, the full operator name is shown.

  • enable_get_object_locations_for_metrics – Whether to enable get_object_locations for metrics.

  • write_file_retry_on_errors – A list of substrings of error messages that should trigger a retry when writing files. This is useful for handling transient errors when writing to remote storage systems.

  • warn_on_driver_memory_usage_bytes – If driver memory exceeds this threshold, Ray Data warns you. For now, this only applies to shuffle ops because most other ops are unlikely to use as much driver memory.

  • actor_task_retry_on_errors – The application-level errors that actor task should retry. This follows same format as retry_exceptions in Ray Core. Default to False to not retry on any errors. Set to True to retry all errors, or set to a list of errors to retry.

  • enable_op_resource_reservation – Whether to reserve resources for each operator.

  • op_resource_reservation_ratio – The ratio of the total resources to reserve for each operator.

  • max_errored_blocks – Max number of blocks that are allowed to have errors, unlimited if negative. This option allows application-level exceptions in block processing tasks. These exceptions may be caused by UDFs (e.g., due to corrupted data samples) or IO errors. Data in the failed blocks are dropped. This option can be useful to prevent a long-running job from failing due to a small number of bad blocks.

  • log_internal_stack_trace_to_stdout – Whether to include internal Ray Data/Ray Core code stack frames when logging to stdout. The full stack trace is always written to the Ray Data log file.

  • raise_original_map_exception – Whether to raise the original exception encountered in map UDF instead of wrapping it in a UserCodeException.

  • print_on_execution_start – If True, print execution information when execution starts.

  • s3_try_create_dir – If True, try to create directories on S3 when a write call is made with a S3 URI.

  • wait_for_min_actors_s – The default time to wait for minimum requested actors to start before raising a timeout, in seconds.

  • retried_io_errors – A list of substrings of error messages that should trigger a retry when reading or writing files. This is useful for handling transient errors when reading from remote storage systems.

DeveloperAPI: This API may change across minor Ray releases.

DataContext.get_current

Get or create a singleton context.