Global configuration#
- class ray.data.DataContext(target_max_block_size: int = 134217728, target_shuffle_max_block_size: int = 1073741824, target_min_block_size: int = 1048576, streaming_read_buffer_size: int = 33554432, enable_pandas_block: bool = True, actor_prefetcher_enabled: bool = False, use_push_based_shuffle: bool = False, pipeline_push_based_shuffle_reduce_tasks: bool = True, scheduling_strategy: None | str | ~ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy | ~ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy | ~ray.util.scheduling_strategies.NodeLabelSchedulingStrategy = 'SPREAD', scheduling_strategy_large_args: None | str | ~ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy | ~ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy | ~ray.util.scheduling_strategies.NodeLabelSchedulingStrategy = 'DEFAULT', large_args_threshold: int = 52428800, use_polars: bool = False, eager_free: bool = True, decoding_size_estimation: bool = True, min_parallelism: int = 200, read_op_min_num_blocks: int = 200, enable_tensor_extension_casting: bool = True, enable_auto_log_stats: bool = False, verbose_stats_logs: bool = False, trace_allocations: bool = False, execution_options: ExecutionOptions = <factory>, use_ray_tqdm: bool = True, enable_progress_bars: bool = True, enable_get_object_locations_for_metrics: bool = False, write_file_retry_on_errors: ~typing.List[str] = ('AWS Error INTERNAL_FAILURE', 'AWS Error NETWORK_CONNECTION', 'AWS Error SLOW_DOWN'), warn_on_driver_memory_usage_bytes: int = 2147483648, actor_task_retry_on_errors: bool | ~typing.List[BaseException] = False, op_resource_reservation_enabled: bool = True, op_resource_reservation_ratio: float = 0.5, max_errored_blocks: int = 0, log_internal_stack_trace_to_stdout: bool = False, print_on_execution_start: bool = True)[source]#
Global settings for Ray Data.
Configure this class to enable advanced features and tune performance.
Warning
Apply changes before creating a
Dataset
. Changes made after won’t take effect.Note
This object is automatically propagated to workers. Access it from the driver and remote workers with
DataContext.get_current()
.Examples
>>> from ray.data import DataContext >>> DataContext.get_current().enable_progress_bars = False
- Parameters:
target_max_block_size – The max target block size in bytes for reads and transformations.
target_shuffle_max_block_size – The max target block size in bytes for shuffle ops like
random_shuffle
,sort
, andrepartition
.target_min_block_size – Ray Data avoids creating blocks smaller than this size in bytes on read. This takes precedence over
read_op_min_num_blocks
.streaming_read_buffer_size – Buffer size when doing streaming reads from local or remote storage.
enable_pandas_block – Whether pandas block format is enabled.
actor_prefetcher_enabled – Whether to use actor based block prefetcher.
use_push_based_shuffle – Whether to use push-based shuffle.
pipeline_push_based_shuffle_reduce_tasks –
scheduling_strategy – The global scheduling strategy. For tasks with large args,
scheduling_strategy_large_args
takes precedence.scheduling_strategy_large_args – Scheduling strategy for tasks with large args.
large_args_threshold – Size in bytes after which point task arguments are considered large. Choose a value so that the data transfer overhead is significant in comparison to task scheduling (i.e., low tens of ms).
use_polars – Whether to use Polars for tabular dataset sorts, groupbys, and aggregations.
eager_free – Whether to eagerly free memory.
decoding_size_estimation – Whether to estimate in-memory decoding data size for data source.
min_parallelism – This setting is deprecated. Use
read_op_min_num_blocks
instead.read_op_min_num_blocks – Minimum number of read output blocks for a dataset.
enable_tensor_extension_casting – Whether to automatically cast NumPy ndarray columns in Pandas DataFrames to tensor extension columns.
enable_auto_log_stats – Whether to automatically log stats after execution. If disabled, you can still manually print stats with
Dataset.stats()
.verbose_stats_logs – Whether stats logs should be verbose. This includes fields such as
extra_metrics
in the stats output, which are excluded by default.trace_allocations – Whether to trace allocations / eager free. This adds significant performance overheads and should only be used for debugging.
execution_options – The
ExecutionOptions
to use.use_ray_tqdm – Whether to enable distributed tqdm.
enable_progress_bars – Whether to enable progress bars.
enable_get_object_locations_for_metrics – Whether to enable
get_object_locations
for metrics.write_file_retry_on_errors – A list of substrings of error messages that should trigger a retry when writing files. This is useful for handling transient errors when writing to remote storage systems.
warn_on_driver_memory_usage_bytes – If driver memory exceeds this threshold, Ray Data warns you. For now, this only applies to shuffle ops because most other ops are unlikely to use as much driver memory.
actor_task_retry_on_errors – The application-level errors that actor task should retry. This follows same format as retry_exceptions in Ray Core. Default to
False
to not retry on any errors. Set toTrue
to retry all errors, or set to a list of errors to retry.enable_op_resource_reservation – Whether to reserve resources for each operator.
op_resource_reservation_ratio – The ratio of the total resources to reserve for each operator.
max_errored_blocks – Max number of blocks that are allowed to have errors, unlimited if negative. This option allows application-level exceptions in block processing tasks. These exceptions may be caused by UDFs (e.g., due to corrupted data samples) or IO errors. Data in the failed blocks are dropped. This option can be useful to prevent a long-running job from failing due to a small number of bad blocks.
log_internal_stack_trace_to_stdout – Whether to include internal Ray Data/Ray Core code stack frames when logging to stdout. The full stack trace is always written to the Ray Data log file.
print_on_execution_start – If
True
, print execution information when execution starts.
DeveloperAPI: This API may change across minor Ray releases.
Get or create a singleton context. |