Global configuration#

class ray.data.context.DataContext(target_max_block_size: int | None = 134217728, target_min_block_size: int = 1048576, streaming_read_buffer_size: int = 33554432, enable_pandas_block: bool = True, actor_prefetcher_enabled: bool = False, autoscaling_config: AutoscalingConfig = ..., use_push_based_shuffle: bool = False, _shuffle_strategy: ShuffleStrategy = ShuffleStrategy.HASH_SHUFFLE, pipeline_push_based_shuffle_reduce_tasks: bool = True, default_hash_shuffle_parallelism: int = 200, max_hash_shuffle_aggregators: int | None = None, min_hash_shuffle_aggregator_wait_time_in_s: int = 300, hash_shuffle_aggregator_health_warning_interval_s: int = 30, max_hash_shuffle_finalization_batch_size: int | None = None, join_operator_actor_num_cpus_override: float = None, hash_shuffle_operator_actor_num_cpus_override: float = None, hash_aggregate_operator_actor_num_cpus_override: float = None, scheduling_strategy: None | str | PlacementGroupSchedulingStrategy | NodeAffinitySchedulingStrategy | NodeLabelSchedulingStrategy = 'SPREAD', scheduling_strategy_large_args: None | str | PlacementGroupSchedulingStrategy | NodeAffinitySchedulingStrategy | NodeLabelSchedulingStrategy = 'DEFAULT', large_args_threshold: int = 52428800, use_polars: bool = False, use_polars_sort: bool = False, eager_free: bool = False, decoding_size_estimation: bool = True, min_parallelism: int = 200, read_op_min_num_blocks: int = 200, enable_tensor_extension_casting: bool = True, use_arrow_tensor_v2: bool = True, enable_fallback_to_arrow_object_ext_type: bool | None = None, enable_auto_log_stats: bool = False, verbose_stats_logs: bool = False, trace_allocations: bool = False, execution_options: ExecutionOptions = ..., use_ray_tqdm: bool = True, enable_progress_bars: bool = True, enable_operator_progress_bars: bool = True, enable_progress_bar_name_truncation: bool = True, enable_get_object_locations_for_metrics: bool = False, write_file_retry_on_errors: List[str] = ('AWS Error INTERNAL_FAILURE', 'AWS Error NETWORK_CONNECTION', 'AWS Error SLOW_DOWN', 'AWS Error UNKNOWN (HTTP status 503)'), warn_on_driver_memory_usage_bytes: int = 2147483648, actor_task_retry_on_errors: bool | List[BaseException] = False, op_resource_reservation_enabled: bool = True, op_resource_reservation_ratio: float = 0.5, max_errored_blocks: int = 0, log_internal_stack_trace_to_stdout: bool = False, raise_original_map_exception: bool = False, print_on_execution_start: bool = True, s3_try_create_dir: bool = False, wait_for_min_actors_s: int = -1, max_tasks_in_flight_per_actor: int | None = 4, retried_io_errors: List[str] = ..., enable_per_node_metrics: bool = False, override_object_store_memory_limit_fraction: float = None, memory_usage_poll_interval_s: float | None = 1, dataset_logger_id: str | None = None, _enable_actor_pool_on_exit_hook: bool = False, issue_detectors_config: IssueDetectorsConfiguration = ..., downstream_capacity_backpressure_ratio: float = None, downstream_capacity_backpressure_max_queued_bundles: int = None, enforce_schemas: bool = False, pandas_block_ignore_metadata: bool = False)[source]#

Global settings for Ray Data.

Configure this class to enable advanced features and tune performance.

Warning

Apply changes before creating a Dataset. Changes made after won’t take effect.

Note

This object is automatically propagated to workers. Access it from the driver and remote workers with DataContext.get_current().

Examples

>>> from ray.data import DataContext
>>> DataContext.get_current().enable_progress_bars = False

Parameters:

target_max_block_size – The max target block size in bytes for reads and transformations. If None, this means the block size is infinite.
target_min_block_size – Ray Data avoids creating blocks smaller than this size in bytes on read. This takes precedence over read_op_min_num_blocks.
streaming_read_buffer_size – Buffer size when doing streaming reads from local or remote storage.
enable_pandas_block – Whether pandas block format is enabled.
actor_prefetcher_enabled – Whether to use actor based block prefetcher.
autoscaling_config – Autoscaling configuration.
use_push_based_shuffle – Whether to use push-based shuffle.
pipeline_push_based_shuffle_reduce_tasks
scheduling_strategy – The global scheduling strategy. For tasks with large args, scheduling_strategy_large_args takes precedence.
scheduling_strategy_large_args – Scheduling strategy for tasks with large args.
large_args_threshold – Size in bytes after which point task arguments are considered large. Choose a value so that the data transfer overhead is significant in comparison to task scheduling (i.e., low tens of ms).
use_polars – Whether to use Polars for tabular dataset sorts, groupbys, and aggregations.
eager_free – Whether to eagerly free memory.
decoding_size_estimation – Whether to estimate in-memory decoding data size for data source.
min_parallelism – This setting is deprecated. Use read_op_min_num_blocks instead.
read_op_min_num_blocks – Minimum number of read output blocks for a dataset.
enable_tensor_extension_casting – Whether to automatically cast NumPy ndarray columns in Pandas DataFrames to tensor extension columns.
use_arrow_tensor_v2 – Config enabling V2 version of ArrowTensorArray supporting tensors > 2Gb in size (off by default)
enable_fallback_to_arrow_object_ext_type – Enables fallback to serialize column values not suppported by Arrow natively (like user-defined custom Python classes for ex, etc) using ArrowPythonObjectType (simply serializing these as bytes)
enable_auto_log_stats – Whether to automatically log stats after execution. If disabled, you can still manually print stats with Dataset.stats().
verbose_stats_logs – Whether stats logs should be verbose. This includes fields such as extra_metrics in the stats output, which are excluded by default.
trace_allocations – Whether to trace allocations / eager free. This adds significant performance overheads and should only be used for debugging.
execution_options – The ExecutionOptions to use.
use_ray_tqdm – Whether to enable distributed tqdm.
enable_progress_bars – Whether to enable progress bars.
enable_operator_progress_bars – Whether to enable progress bars for individual operators during execution.
enable_progress_bar_name_truncation – If True, the name of the progress bar (often the operator name) will be truncated if it exceeds ProgressBar.MAX_NAME_LENGTH. Otherwise, the full operator name is shown.
enable_get_object_locations_for_metrics – Whether to enable get_object_locations for metrics.
write_file_retry_on_errors – A list of substrings of error messages that should trigger a retry when writing files. This is useful for handling transient errors when writing to remote storage systems.
warn_on_driver_memory_usage_bytes – If driver memory exceeds this threshold, Ray Data warns you. For now, this only applies to shuffle ops because most other ops are unlikely to use as much driver memory.
actor_task_retry_on_errors – The application-level errors that actor task should retry. This follows same format as retry_exceptions in Ray Core. Default to False to not retry on any errors. Set to True to retry all errors, or set to a list of errors to retry.
op_resource_reservation_enabled – Whether to enable resource reservation for operators to prevent resource contention.
op_resource_reservation_ratio – The ratio of the total resources to reserve for each operator.
max_errored_blocks – Max number of blocks that are allowed to have errors, unlimited if negative. This option allows application-level exceptions in block processing tasks. These exceptions may be caused by UDFs (e.g., due to corrupted data samples) or IO errors. Data in the failed blocks are dropped. This option can be useful to prevent a long-running job from failing due to a small number of bad blocks.
log_internal_stack_trace_to_stdout – Whether to include internal Ray Data/Ray Core code stack frames when logging to stdout. The full stack trace is always written to the Ray Data log file.
raise_original_map_exception – Whether to raise the original exception encountered in map UDF instead of wrapping it in a UserCodeException.
print_on_execution_start – If True, print execution information when execution starts.
s3_try_create_dir – If True, try to create directories on S3 when a write call is made with a S3 URI.
wait_for_min_actors_s – The default time to wait for minimum requested actors to start before raising a timeout, in seconds.
max_tasks_in_flight_per_actor – Max number of tasks that could be submitted for execution to individual actor at the same time. Note that only up to max_concurrency number of these tasks will be executing concurrently while remaining ones will be waiting in the Actor’s queue. Buffering tasks in the queue allows us to overlap pulling of the blocks (which are tasks arguments) with the execution of the prior tasks maximizing individual Actor’s utilization
retried_io_errors – A list of substrings of error messages that should trigger a retry when reading or writing files. This is useful for handling transient errors when reading from remote storage systems.
default_hash_shuffle_parallelism – Default parallelism level for hash-based shuffle operations if the number of partitions is unspecifed.
max_hash_shuffle_aggregators – Maximum number of aggregating actors that can be provisioned for hash-shuffle aggregations.
min_hash_shuffle_aggregator_wait_time_in_s – Minimum time to wait for hash shuffle aggregators to become available, in seconds.
hash_shuffle_aggregator_health_warning_interval_s – Interval for health warning checks on hash shuffle aggregators, in seconds.
max_hash_shuffle_finalization_batch_size – Maximum batch size for concurrent hash-shuffle finalization tasks. If None, defaults to max_hash_shuffle_aggregators.
join_operator_actor_num_cpus_per_partition_override – Override CPU allocation per partition for join operator actors.
hash_shuffle_operator_actor_num_cpus_per_partition_override – Override CPU allocation per partition for hash shuffle operator actors.
hash_aggregate_operator_actor_num_cpus_per_partition_override – Override CPU allocation per partition for hash aggregate operator actors.
use_polars_sort – Whether to use Polars for tabular dataset sorting operations.
enable_per_node_metrics – Enable per node metrics reporting for Ray Data, disabled by default.
override_object_store_memory_limit_fraction – Override the fraction of object store memory limit. If None, uses Ray’s default.
memory_usage_poll_interval_s – The interval to poll the USS of map tasks. If None, map tasks won’t record memory stats.
dataset_logger_id – Optional logger ID for dataset operations. If None, uses default logging configuration.
issue_detectors_config – Configuration for issue detection and monitoring during dataset operations.
downstream_capacity_backpressure_ratio – Ratio for downstream capacity backpressure control. A higher ratio causes backpressure to kick-in later. If None, this type of backpressure is disabled.
downstream_capacity_backpressure_max_queued_bundles – Maximum number of queued bundles before applying backpressure. If None, no limit is applied.
enforce_schemas – Whether to enforce schema consistency across dataset operations.
pandas_block_ignore_metadata – Whether to ignore pandas metadata when converting between Arrow and pandas formats for better type inference.

DeveloperAPI: This API may change across minor Ray releases.

DataContext.get_current

Get or create the current DataContext.

class ray.data.context.AutoscalingConfig(actor_pool_util_upscaling_threshold: float = 2.0, actor_pool_util_downscaling_threshold: float = 0.5)[source]#

Configuration for autoscaling of Ray Data.

Parameters:

actor_pool_util_upscaling_threshold – Actor Pool utilization threshold for upscaling. Once Actor Pool exceeds this utilization threshold it will start adding new actors. Actor Pool utilization is defined as ratio of number of submitted tasks to the number of available concurrency-slots to run them in the current set of actors. This utilization value could exceed 100%, when the number of submitted tasks exceed available concurrency-slots to run them in the current set of actors. This is possible when max_tasks_in_flight_per_actor (defaults to 2 x of max_concurrency) > Actor’s max_concurrency and allows to overlap task execution with the fetching of the blocks for the next task providing for ability to negotiate a trade-off between autoscaling speed and resource efficiency (i.e., making tasks wait instead of immediately triggering execution).
actor_pool_util_downscaling_threshold – Actor Pool utilization threshold for downscaling.

DeveloperAPI: This API may change across minor Ray releases.