ray.data.from_spark#
- ray.data.from_spark(df: pyspark.sql.DataFrame, *, parallelism: int | None = None, override_num_blocks: int | None = None) MaterializedDataset[source]#
Create a
Datasetfrom a Spark DataFrame.- Parameters:
df – A Spark DataFrame, which must be created by RayDP (Spark-on-Ray).
parallelism – This argument is deprecated. Use
override_num_blocksargument.override_num_blocks – Override the number of output blocks from all read tasks. By default, the number of output blocks is dynamically decided based on input data size and available resources. You shouldn’t manually set this value in most cases.
- Returns:
A
MaterializedDatasetholding rows read from the DataFrame.