Using Mars on Ray#

Mars is a tensor-based unified framework for large-scale data computation which scales Numpy, Pandas and Scikit-learn. Mars on Ray makes it easy to scale your programs with a Ray cluster. Currently Mars on Ray supports both Ray actors and tasks as execution backend. The task will be scheduled by mars scheduler if Ray actors is used. This mode can reuse all mars scheduler optimizations. If ray tasks mode is used, all tasks will be scheduled by ray, which can reuse failover and pipeline capabilities provided by ray futures.

Installation#

You can simply install Mars via pip:

pip install pymars>=0.8.3

Getting started#

It’s easy to run Mars jobs on a Ray cluster.

Starting a new Mars on Ray runtime locally via:

import ray
ray.init()
import mars
mars.new_ray_session()
import mars.tensor as mt
mt.random.RandomState(0).rand(1000_0000, 5).sum().execute()

Or connecting to a Mars on Ray runtime which is already initialized:

import mars
mars.new_ray_session('http://<web_ip>:<ui_port>')
# perform computation

Interact with Dataset:

import mars.tensor as mt
import mars.dataframe as md
df = md.DataFrame(
    mt.random.rand(1000_0000, 4),
    columns=list('abcd'))
# Convert mars dataframe to ray dataset
import ray
# ds = md.to_ray_dataset(df)
ds = ray.data.from_mars(df)
print(ds.schema(), ds.count())
ds.filter(lambda row: row["a"] > 0.5).show(5)
# Convert ray dataset to mars dataframe
# df2 = md.read_ray_dataset(ds)
df2 = ds.to_mars()
print(df2.head(5).execute())

Refer to Mars on Ray: https://mars-project.readthedocs.io/en/latest/installation/ray.html#mars-ray for more information.