ray.data.from_spark#
- ray.data.from_spark(df: pyspark.sql.DataFrame, *, parallelism: int | None = None) MaterializedDataset [source]#
Create a
Dataset
from a Spark DataFrame.- Parameters:
df – A Spark DataFrame, which must be created by RayDP (Spark-on-Ray).
parallelism – The amount of parallelism to use for the dataset. If not provided, the parallelism is equal to the number of partitions of the original Spark DataFrame.
- Returns:
A
MaterializedDataset
holding rows read from the DataFrame.