ray.data.extensions.tensor_extension.TensorArray
ray.data.extensions.tensor_extension.TensorArray#
- class ray.data.extensions.tensor_extension.TensorArray(values: Union[numpy.ndarray, pandas.core.dtypes.generic.ABCSeries, Sequence[Union[numpy.ndarray, ray.air.util.tensor_extensions.pandas.TensorArrayElement]], ray.air.util.tensor_extensions.pandas.TensorArrayElement, Any])[source]#
Bases:
pandas.core.arrays.base.ExtensionArray
,ray.air.util.tensor_extensions.pandas._TensorOpsMixin
,ray.air.util.tensor_extensions.pandas._TensorScalarCastMixin
Pandas
ExtensionArray
representing a tensor column, i.e. a column consisting of ndarrays as elements.This extension supports tensors in which the elements have different shapes. However, each tensor element must be non-ragged, i.e. each tensor element must have a well-defined, non-ragged shape.
Examples
>>> # Create a DataFrame with a list of ndarrays as a column. >>> import pandas as pd >>> import numpy as np >>> import ray >>> from ray.data.extensions import TensorArray >>> df = pd.DataFrame({ ... "one": [1, 2, 3], ... "two": TensorArray(np.arange(24).reshape((3, 2, 2, 2)))}) >>> # Note that the column dtype is TensorDtype. >>> df.dtypes one int64 two TensorDtype(shape=(3, 2, 2, 2), dtype=int64) dtype: object >>> # Pandas is aware of this tensor column, and we can do the >>> # typical DataFrame operations on this column. >>> col = 2 * (df["two"] + 10) >>> # The ndarrays underlying the tensor column will be manipulated, >>> # but the column itself will continue to be a Pandas type. >>> type(col) pandas.core.series.Series >>> col 0 [[[ 2 4] [ 6 8]] [[10 12] [14 16]]] 1 [[[18 20] [22 24]] [[26 28] [30 32]]] 2 [[[34 36] [38 40]] [[42 44] [46 48]]] Name: two, dtype: TensorDtype(shape=(3, 2, 2, 2), dtype=int64) >>> # Once you do an aggregation on that column that returns a single >>> # row's value, you get back our TensorArrayElement type. >>> tensor = col.mean() >>> type(tensor) ray.data.extensions.tensor_extension.TensorArrayElement >>> tensor array([[[18., 20.], [22., 24.]], [[26., 28.], [30., 32.]]]) >>> # This is a light wrapper around a NumPy ndarray, and can easily >>> # be converted to an ndarray. >>> type(tensor.to_numpy()) numpy.ndarray >>> # In addition to doing Pandas operations on the tensor column, >>> # you can now put the DataFrame into a Dataset. >>> ds = ray.data.from_pandas(df) >>> # Internally, this column is represented the corresponding >>> # Arrow tensor extension type. >>> ds.schema() one: int64 two: extension<arrow.py_extension_type<ArrowTensorType>> >>> # You can write the dataset to Parquet. >>> ds.write_parquet("/some/path") >>> # And you can read it back. >>> read_ds = ray.data.read_parquet("/some/path") >>> read_ds.schema() one: int64 two: extension<arrow.py_extension_type<ArrowTensorType>>
>>> read_df = ray.get(read_ds.to_pandas_refs())[0] >>> read_df.dtypes one int64 two TensorDtype(shape=(3, 2, 2, 2), dtype=int64) dtype: object >>> # The tensor extension type is preserved along the >>> # Pandas --> Arrow --> Parquet --> Arrow --> Pandas >>> # conversion chain. >>> read_df.equals(df) True
PublicAPI (beta): This API is in beta and may change before becoming stable.
- property dtype: pandas.core.dtypes.base.ExtensionDtype#
An instance of ‘ExtensionDtype’.
- property is_variable_shaped#
Whether this TensorArray holds variable-shaped tensor elements.
- property nbytes: int#
The number of bytes needed to store this object in memory.
- isna() ray.air.util.tensor_extensions.pandas.TensorArray [source]#
A 1-D array indicating if each value is missing.
- Returns
na_values – In most cases, this should return a NumPy ndarray. For exceptional cases like
SparseArray
, where returning an ndarray would be expensive, an ExtensionArray may be returned.- Return type
Union[np.ndarray, ExtensionArray]
Notes
If returning an ExtensionArray, then
na_values._is_boolean
should be Truena_values
should implementExtensionArray._reduce()
na_values.any
andna_values.all
should be implemented
- take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) ray.air.util.tensor_extensions.pandas.TensorArray [source]#
Take elements from an array.
- Parameters
indices (sequence of int) – Indices to be taken.
allow_fill (bool, default False) –
How to handle negative values in
indices
.False: negative values in
indices
indicate positional indices from the right (the default). This is similar tonumpy.take()
.True: negative values in
indices
indicate missing values. These values are set tofill_value
. Any other other negative values raise aValueError
.
fill_value (any, optional) –
Fill value to use for NA-indices when
allow_fill
is True. This may beNone
, in which case the default NA value for the type,self.dtype.na_value
, is used.For many ExtensionArrays, there will be two representations of
fill_value
: a user-facing “boxed” scalar, and a low-level physical NA value.fill_value
should be the user-facing version, and the implementation should handle translating that to the physical version for processing the take if necessary.
- Returns
- Return type
ExtensionArray
- Raises
IndexError – When the indices are out of bounds for the array.
ValueError – When
indices
contains negative values other than-1
andallow_fill
is True.
See also
numpy.take
Take elements from an array along an axis.
api.extensions.take
Take elements from an array.
Notes
ExtensionArray.take is called by
Series.__getitem__
,.loc
,iloc
, whenindices
is a sequence of values. Additionally, it’s called bySeries.reindex()
, or any other method that causes realignment, with afill_value
.Examples
Here’s an example implementation, which relies on casting the extension array to object dtype. This uses the helper method
pandas.api.extensions.take()
.def take(self, indices, allow_fill=False, fill_value=None): from pandas.core.algorithms import take # If the ExtensionArray is backed by an ndarray, then # just pass that here instead of coercing to object. data = self.astype(object) if allow_fill and fill_value is None: fill_value = self.dtype.na_value # fill value should always be translated from the scalar # type for the array, to the physical storage type for # the data, before passing to take. result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill) return self._from_sequence(result, dtype=self.dtype)
- copy() ray.air.util.tensor_extensions.pandas.TensorArray [source]#
Return a copy of the array.
- Returns
- Return type
ExtensionArray
- to_numpy(dtype: Optional[numpy.dtype] = None, copy: bool = False, na_value: Any = NoDefault.no_default)[source]#
Convert to a NumPy ndarray.
New in version 1.0.0.
This is similar to
numpy.asarray()
, but may provide additional control over how the conversion is done.- Parameters
dtype (str or numpy.dtype, optional) – The dtype to pass to
numpy.asarray()
.copy (bool, default False) – Whether to ensure that the returned value is a not a view on another array. Note that
copy=False
does not ensure thatto_numpy()
is no-copy. Rather,copy=True
ensure that a copy is made, even if not strictly necessary.na_value (Any, optional) – The value to use for missing values. The default value depends on
dtype
and the type of the array.
- Returns
- Return type
numpy.ndarray
- property numpy_dtype#
Get the dtype of the tensor. :return: The numpy dtype of the backing ndarray
- property numpy_ndim#
Get the number of tensor dimensions. :return: integer for the number of dimensions
- property numpy_shape#
Get the shape of the tensor. :return: A tuple of integers for the numpy shape of the backing ndarray
- property numpy_size#
Get the size of the tensor. :return: integer for the number of elements in the tensor
- astype(dtype, copy=True)[source]#
Cast to a NumPy array with ‘dtype’.
- Parameters
dtype (str or dtype) – Typecode or data-type to which the array is cast.
copy (bool, default True) – Whether to copy the data, even if not necessary. If False, a copy is made only if the old dtype does not match the new dtype.
- Returns
array – NumPy ndarray with ‘dtype’ for its dtype.
- Return type
ndarray
- any(axis=None, out=None, keepdims=False)[source]#
Test whether any array element along a given axis evaluates to True.
See numpy.any() documentation for more information https://numpy.org/doc/stable/reference/generated/numpy.any.html#numpy.any
- Parameters
axis – Axis or axes along which a logical OR reduction is performed.
out – Alternate output array in which to place the result.
keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.
- Returns
single boolean unless axis is not None else TensorArray
- all(axis=None, out=None, keepdims=False)[source]#
Test whether all array elements along a given axis evaluate to True.
- Parameters
axis – Axis or axes along which a logical AND reduction is performed.
out – Alternate output array in which to place the result.
keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.
- Returns
single boolean unless axis is not None else TensorArray