ray.data.preprocessors.BatchMapper
ray.data.preprocessors.BatchMapper#
- class ray.data.preprocessors.BatchMapper(fn: Union[Callable[[pandas.DataFrame], pandas.DataFrame], Callable[[Union[numpy.ndarray, Dict[str, numpy.ndarray]]], Union[numpy.ndarray, Dict[str, numpy.ndarray]]]], batch_format: Optional[ray.air.util.data_batch_conversion.BatchFormat], batch_size: Optional[Union[int, typing_extensions.Literal[default]]] = 'default')[source]#
Bases:
ray.data.preprocessor.Preprocessor
Apply an arbitrary operation to a dataset.
BatchMapper
applies a user-defined function to batches of a dataset. A batch is a PandasDataFrame
that represents a small amount of data. By modifying batches instead of individual records, this class can efficiently transform a dataset with vectorized operations.Use this preprocessor to apply stateless operations that aren’t already built-in.
Tip
BatchMapper
doesn’t need to be fit. You can calltransform
without callingfit
.Examples
Use
BatchMapper
to apply arbitrary operations like dropping a column.>>> import pandas as pd >>> import numpy as np >>> from typing import Dict >>> import ray >>> from ray.data.preprocessors import BatchMapper >>> >>> df = pd.DataFrame({"X": [0, 1, 2], "Y": [3, 4, 5]}) >>> ds = ray.data.from_pandas(df) >>> >>> def fn(batch: pd.DataFrame) -> pd.DataFrame: ... return batch.drop("Y", axis="columns") >>> >>> preprocessor = BatchMapper(fn, batch_format="pandas") >>> preprocessor.transform(ds) Dataset(num_blocks=1, num_rows=3, schema={X: int64}) >>> >>> def fn_numpy(batch: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]: ... return {"X": batch["X"]} >>> preprocessor = BatchMapper(fn_numpy, batch_format="numpy") >>> preprocessor.transform(ds) Dataset(num_blocks=1, num_rows=3, schema={X: int64})
- Parameters
fn – The function to apply to data batches.
batch_size – The desired number of rows in each data batch provided to
fn
. Semantics are the same as in`dataset.map_batches()
: specifyingNone
wil use the entire underlying blocks as batches (blocks may contain different number of rows) and the actual size of the batch provided tofn
may be smaller thanbatch_size
ifbatch_size
doesn’t evenly divide the block(s) sent to a given map task. Defaults to 4096, which is the same default value asdataset.map_batches()
.batch_format – The preferred batch format to use in UDF. If not given, we will infer based on the input dataset data format.
PublicAPI (alpha): This API is in alpha and may change before becoming stable.