Ray State API

Note

APIs are alpha. This feature requires a full installation of Ray using pip install "ray[default]".

For an overview with examples see Monitoring Ray States.

For the CLI reference see Ray State CLI Reference or Ray Log CLI Reference.

State Python SDK

State APIs are also exported as functions.

Summary APIs

ray.experimental.state.api.summarize_actors(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict[source]

Summarize the actors in cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout for requests made when getting the states.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

Dictionarified ActorSummaries

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.summarize_objects(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict[source]

Summarize the objects in cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout for requests made when getting the states.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

Dictionarified ObjectSummaries

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.summarize_tasks(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict[source]

Summarize the tasks in cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout for requests made when getting the states.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

Dictionarified TaskSummaries

Raises

ExceptionsRayStateApiException if the CLI is failed to query the data.

List APIs

ray.experimental.state.api.list_actors(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List actors in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("id", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in ActorState) will be queried and returned. See ActorState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified ActorState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_placement_groups(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List placement groups in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("state", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in PlacementGroupState) will be queried and returned. See PlacementGroupState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified PlacementGroupState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_nodes(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List nodes in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("node_name", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in NodeState) will be queried and returned. See NodeState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified NodeState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_jobs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List jobs submitted to the cluster by :ref: ray job submission.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("status", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in JobState) will be queried and returned. See JobState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified JobState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_workers(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List workers in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("is_alive", "=", "True")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in WorkerState) will be queried and returned. See WorkerState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified WorkerState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_tasks(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List tasks in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("is_alive", "=", "True")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in WorkerState) will be queried and returned. See WorkerState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified WorkerState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_objects(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List objects in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("ip", "=", "0.0.0.0")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in ObjectState) will be queried and returned. See ObjectState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified ObjectState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_runtime_envs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]

List runtime environments in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("node_id", "=", "abcdef")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in RuntimeEnvState) will be queried and returned. See RuntimeEnvState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified RuntimeEnvState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

Get APIs

ray.experimental.state.api.get_actor(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]

Get an actor by id.

Parameters
  • id – Id of the actor

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state API requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified ActorState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_placement_group(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]

Get a placement group by id.

Parameters
  • id – Id of the placement group

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified PlacementGroupState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_node(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]

Get a node by id.

Parameters
  • id – Id of the node.

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified NodeState.

Raises

ExceptionsRayStateApiException if the CLI is failed to query the data.

ray.experimental.state.api.get_worker(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]

Get a worker by id.

Parameters
  • id – Id of the worker

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified WorkerState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_task(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]

Get a task by id.

Parameters
  • id – Id of the task

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified TaskState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_objects(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) List[Dict][source]

Get objects by id.

There could be more than 1 entry returned since an object could be referenced at different places.

Parameters
  • id – Id of the object

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified ObjectState.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

Log APIs

ray.experimental.state.api.list_logs(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, glob_filter: Optional[str] = None, timeout: int = 30) Dict[str, List[str]][source]

Listing log files available.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If not specified, it will be retrieved from the initialized ray cluster.

  • node_id – Id of the node containing the logs.

  • node_ip – Ip of the node containing the logs.

  • glob_filter – Name of the file (relative to the ray log directory) to be retrieved. E.g. glob_filter="*worker*" for all worker logs.

  • actor_id – Id of the actor if getting logs from an actor.

  • timeout – Max timeout for requests made when getting the logs.

  • _interval – The interval in secs to print new logs when follow=True.

Returns

A dictionary where the keys are log groups (e.g. gcs, raylet, worker), and values are list of log filenames.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data, or ConnectionError if failed to resolve the ray address.

ray.experimental.state.api.get_log(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, filename: Optional[str] = None, actor_id: Optional[str] = None, task_id: Optional[str] = None, pid: Optional[int] = None, follow: bool = False, tail: int = 1000, timeout: int = 30, suffix: Optional[str] = None, _interval: Optional[float] = None) Generator[str, None, None][source]

Retrieve log file based on file name or some entities ids (pid, actor id, task id).

Examples

>>> import ray
>>> from ray.experimental.state.api import get_log 
# To connect to an existing ray instance if there is
>>> ray.init("auto") 
# Node IP could be retrieved from list_nodes() or ray.nodes()
>>> node_ip = "172.31.47.143"  
>>> filename = "gcs_server.out" 
>>> for l in get_log(filename=filename, node_ip=node_ip): 
>>>    print(l) 
Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If not specified, it will be retrieved from the initialized ray cluster.

  • node_id – Id of the node containing the logs .

  • node_ip – Ip of the node containing the logs. (At least one of the node_id and node_ip have to be supplied when identifying a node).

  • filename – Name of the file (relative to the ray log directory) to be retrieved.

  • actor_id – Id of the actor if getting logs from an actor.

  • task_id – Id of the task if getting logs generated by a task.

  • pid – PID of the worker if getting logs generated by a worker. When querying with pid, either node_id or node_ip must be supplied.

  • follow – When set to True, logs will be streamed and followed.

  • tail – Number of lines to get from the end of the log file. Set to -1 for getting the entire log.

  • timeout – Max timeout for requests made when getting the logs.

  • suffix – The suffix of the log file if query by id of tasks/workers/actors.

  • _interval – The interval in secs to print new logs when follow=True.

Returns

A Generator of log line, None for SendType and ReturnType.

Raises

ExceptionsRayStateApiException if the CLI failed to query the data.

State APIs Schema

ActorState

class ray.experimental.state.common.ActorState(actor_id: str, class_name: str, state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], job_id: str, name: Optional[str], node_id: str, pid: int, ray_namespace: str, serialized_runtime_env: str, required_resources: dict, death_cause: Optional[dict], is_detached: bool)[source]

Actor State

Below columns can be used for the --filter option.

job_id

state

ray_namespace

class_name

name

actor_id

pid

node_id

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

is_detached

death_cause

required_resources

serialized_runtime_env

actor_id: str

The id of the actor.

class_name: str

The class name of the actor.

state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD]

The state of the actor.

  • DEPENDENCIES_UNREADY: Actor is waiting for dependency to be ready. E.g., a new actor is waiting for object ref that’s created from other remote task.

  • PENDING_CREATION: Actor’s dependency is ready, but it is not created yet. It could be because there are not enough resources, too many actor entries in the scheduler queue, or the actor creation is slow (e.g., slow runtime environment creation, slow worker startup, or etc.).

  • ALIVE: The actor is created, and it is alive.

  • RESTARTING: The actor is dead, and it is restarting. It is equivalent to PENDING_CREATION, but means the actor was dead more than once.

  • DEAD: The actor is permanatly dead.

job_id: str

The job id of this actor.

name: Optional[str]

The name of the actor given by the name argument.

node_id: str

The node id of this actor. If the actor is restarting, it could be the node id of the dead actor (and it will be re-updated when the actor is successfully restarted).

pid: int

The pid of the actor. 0 if it is not created yet.

ray_namespace: str

The namespace of the actor.

serialized_runtime_env: str

The runtime environment information of the actor.

required_resources: dict

The resource requirement of the actor.

death_cause: Optional[dict]

Actor’s death information in detail. None if the actor is not dead yet.

is_detached: bool

True if the actor is detached. False otherwise.

TaskState

class ray.experimental.state.common.TaskState(task_id: str, name: str, scheduling_state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], job_id: str, node_id: str, actor_id: str, type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK], func_or_class_name: str, language: str, required_resources: dict, runtime_env_info: str)[source]

Task State

Below columns can be used for the --filter option.

job_id

func_or_class_name

language

name

actor_id

type

task_id

scheduling_state

node_id

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

language

required_resources

runtime_env_info

task_id: str

The id of the task.

name: str

The name of the task if it is given by the name argument.

scheduling_state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED]

The state of the task.

Refer to src/ray/protobuf/common.proto for a detailed explanation of the state breakdowns and typical state transition flow.

job_id: str

The job id of this task.

node_id: str

Id of the node that runs the task. If the task is retried, it could contain the node id of the previous executed task. If empty, it means the task hasn’t been scheduled yet.

actor_id: str

The actor id that’s associated with this task. It is empty if there’s no relevant actors.

type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK]

The type of the task.

  • NORMAL_TASK: Tasks created by func.remote()`

  • ACTOR_CREATION_TASK: Actors created by class.remote()

  • ACTOR_TASK: Actor tasks submitted by actor.method.remote()

  • DRIVER_TASK: Driver (A script that calls ray.init).

func_or_class_name: str

The name of the task. If is the name of the function if the type is a task or an actor task. It is the name of the class if it is a actor scheduling task.

language: str

The language of the task. E.g., Python, Java, or Cpp.

required_resources: dict

The required resources to execute the task.

runtime_env_info: str

The runtime environment information for the task.

NodeState

class ray.experimental.state.common.NodeState(node_id: str, node_ip: str, state: typing_extensions.Literal[ALIVE, DEAD], node_name: str, resources_total: dict)[source]

Node State

Below columns can be used for the --filter option.

state

node_ip

node_name

node_id

node_id: str

The id of the node.

node_ip: str

The ip address of the node.

state: typing_extensions.Literal[ALIVE, DEAD]

The state of the node.

ALIVE: The node is alive. DEAD: The node is dead.

node_name: str

The name of the node if it is given by the name argument.

resources_total: dict

The total resources of the node.

PlacementGroupState

class ray.experimental.state.common.PlacementGroupState(placement_group_id: str, name: str, state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING], bundles: dict, is_detached: bool, stats: dict)[source]

PlacementGroup State

Below columns can be used for the --filter option.

placement_group_id

state

is_detached

name

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

is_detached

bundles

stats

placement_group_id: str

The id of the placement group.

name: str

The name of the placement group if it is given by the name argument.

state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING]

The state of the placement group.

  • PENDING: The placement group creation is pending scheduling. It could be because there’s not enough resources, some of creation stage has failed (e.g., failed to commit placement gropus because the node is dead).

  • CREATED: The placement group is created.

  • REMOVED: The placement group is removed.

  • RESCHEDULING: The placement group is rescheduling because some of bundles are dead because they were on dead nodes.

bundles: dict

The bundle specification of the placement group.

is_detached: bool

True if the placement group is detached. False otherwise.

stats: dict

The scheduling stats of the placement group.

WorkerState

class ray.experimental.state.common.WorkerState(worker_id: str, is_alive: bool, worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]], node_id: str, ip: str, pid: str, exit_detail: Optional[str])[source]

Worker State

Below columns can be used for the --filter option.

exit_type

worker_id

is_alive

ip

worker_type

pid

node_id

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

exit_detail

worker_id: str

The id of the worker.

is_alive: bool

Whether or not if the worker is alive.

worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER]

The driver (Python script that calls ray.init). - SPILL_WORKER: The worker that spills objects. - RESTORE_WORKER: The worker that restores objects.

Type
  • DRIVER

exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]]

The exit type of the worker if the worker is dead.

  • SYSTEM_ERROR: Worker exit due to system level failures (i.e. worker crash).

  • INTENDED_SYSTEM_EXIT: System-level exit that is intended. E.g., Workers are killed because they are idle for a long time.

  • USER_ERROR: Worker exits because of user error. E.g., execptions from the actor initialization.

  • INTENDED_USER_EXIT: Intended exit from users (e.g., users exit workers with exit code 0 or exit initated by Ray API such as ray.kill).

node_id: str

The node id of the worker.

ip: str

The ip address of the worker.

pid: str

The pid of the worker.

exit_detail: Optional[str]

The exit detail of the worker if the worker is dead.

ObjectState

class ray.experimental.state.common.ObjectState(object_id: str, object_size: int, task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], call_site: str, type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], pid: int, ip: str)[source]

Object State

Below columns can be used for the --filter option.

task_status

call_site

type

object_id

ip

pid

object_size

reference_type

object_id: str

The id of the object.

object_size: int

The size of the object in mb.

task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED]

The status of the task that creates the object.

  • NIL: We don’t have a status for this task because we are not the owner or the task metadata has already been deleted.

  • WAITING_FOR_DEPENDENCIES: The task is waiting for its dependencies to be created.

  • SCHEDULED: All dependencies have been created and the task is scheduled to execute. It could be because the task is waiting for resources, runtime environmenet creation, fetching dependencies to the local node, and etc..

  • FINISHED: The task finished successfully.

  • WAITING_FOR_EXECUTION: The task is scheduled properly and waiting for execution. It includes time to deliver the task to the remote worker + queueing time from the execution side.

  • RUNNING: The task that is running.

reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS]

The reference type of the object. See Debugging with Ray Memory for more details.

  • ACTOR_HANDLE: The reference is an actor handle.

  • PINNED_IN_MEMORY: The object is pinned in memory, meaning there’s in-flight ray.get on this reference.

  • LOCAL_REFERENCE: There’s a local reference (e.g., Python reference) to this object reference. The object won’t be GC’ed until all of them is gone.

  • USED_BY_PENDING_TASK: The object reference is passed to other tasks. E.g., a = ray.put() -> task.remote(a). In this case, a is used by a pending task task.

  • CAPTURED_IN_OBJECT: The object is serialized by other objects. E.g., a = ray.put(1) -> b = ray.put([a]). a is serialized within a list.

  • UNKNOWN_STATUS: The object ref status is unkonwn.

call_site: str

The callsite of the object.

type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER]

The worker type that creates the object.

  • WORKER: The regular Ray worker process that executes tasks or instantiates an actor.

  • DRIVER: The driver (Python script that calls ray.init).

  • SPILL_WORKER: The worker that spills objects.

  • RESTORE_WORKER: The worker that restores objects.

pid: int

The pid of the owner.

ip: str

The ip address of the owner.

RuntimeEnvState

class ray.experimental.state.common.RuntimeEnvState(runtime_env: str, success: bool, creation_time_ms: Optional[float], node_id: str, ref_cnt: int, error: Optional[str])[source]

Runtime Environment State

Below columns can be used for the --filter option.

success

runtime_env

error

node_id

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

ref_cnt

error

runtime_env: str

The runtime environment spec.

success: bool

Whether or not the runtime env creation has succeeded.

creation_time_ms: Optional[float]

The latency of creating the runtime environment. Available if the runtime env is successfully created.

node_id: str

The node id of this runtime environment.

ref_cnt: int

The number of actors and tasks that use this runtime environment.

error: Optional[str]

The error message if the runtime environment creation has failed. Available if the runtime env is failed to be created.

JobState

class ray.experimental.state.common.JobState(status: ray.dashboard.modules.job.common.JobStatus, entrypoint: str, message: Optional[str] = None, error_type: Optional[str] = None, start_time: Optional[int] = None, end_time: Optional[int] = None, metadata: Optional[Dict[str, str]] = None, runtime_env: Optional[Dict[str, Any]] = None, entrypoint_num_cpus: Optional[Union[int, float]] = None, entrypoint_num_gpus: Optional[Union[int, float]] = None, entrypoint_resources: Optional[Dict[str, float]] = None, driver_agent_http_address: Optional[str] = None, driver_node_id: Optional[str] = None)[source]

The state of the job that’s submitted by Ray’s Job APIs

Below columns can be used for the --filter option.

status

entrypoint

error_type

classmethod list_columns() List[str][source]

Return a list of columns.

classmethod filterable_columns() Set[str][source]

Return a list of filterable columns

StateSummary

class ray.experimental.state.common.StateSummary(node_id_to_summary: Dict[str, Union[ray.experimental.state.common.TaskSummaries, ray.experimental.state.common.ActorSummaries, ray.experimental.state.common.ObjectSummaries]])[source]
node_id_to_summary: Dict[str, Union[ray.experimental.state.common.TaskSummaries, ray.experimental.state.common.ActorSummaries, ray.experimental.state.common.ObjectSummaries]]

Node ID -> summary per node If the data is not required to be orgnized per node, it will contain a single key, “cluster”.

TaskSummary

class ray.experimental.state.common.TaskSummaries(summary: Dict[str, ray.experimental.state.common.TaskSummaryPerFuncOrClassName], total_tasks: int, total_actor_tasks: int, total_actor_scheduled: int, summary_by: str = 'func_name')[source]
total_tasks: int

Total Ray tasks.

total_actor_tasks: int

Total actor tasks.

total_actor_scheduled: int

Total scheduled actors.

class ray.experimental.state.common.TaskSummaryPerFuncOrClassName(func_or_class_name: str, type: str, state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = <factory>)[source]
func_or_class_name: str

The function or class name of this task.

type: str

The type of the class. Equivalent to protobuf TaskType.

state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int]

State name to the count dict. State name is equivalent to the protobuf TaskStatus.

ActorSummary

class ray.experimental.state.common.ActorSummaries(summary: Dict[str, ray.experimental.state.common.ActorSummaryPerClass], total_actors: int, summary_by: str = 'class')[source]
summary: Dict[str, ray.experimental.state.common.ActorSummaryPerClass]

Group key (actor class name) -> summary

total_actors: int

Total number of actors

class ray.experimental.state.common.ActorSummaryPerClass(class_name: str, state_counts: Dict[typing_extensions.Literal['DEPENDENCIES_UNREADY', 'PENDING_CREATION', 'ALIVE', 'RESTARTING', 'DEAD'], int] = <factory>)[source]
class_name: str

The class name of the actor.

state_counts: Dict[typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], int]

State name to the count dict. State name is equivalent to the protobuf ActorState.

ObjectSummary

class ray.experimental.state.common.ObjectSummaries(summary: Dict[str, ray.experimental.state.common.ObjectSummaryPerKey], total_objects: int, total_size_mb: float, callsite_enabled: bool, summary_by: str = 'callsite')[source]
summary: Dict[str, ray.experimental.state.common.ObjectSummaryPerKey]

Group key (actor class name) -> summary

total_objects: int

Total number of referenced objects in the cluster.

total_size_mb: float

Total size of referenced objects in the cluster in MB.

callsite_enabled: bool

Whether or not the callsite collection is enabled.

class ray.experimental.state.common.ObjectSummaryPerKey(total_objects: int, total_size_mb: float, total_num_workers: int, total_num_nodes: int, task_state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = <factory>, ref_type_counts: Dict[typing_extensions.Literal['ACTOR_HANDLE', 'PINNED_IN_MEMORY', 'LOCAL_REFERENCE', 'USED_BY_PENDING_TASK', 'CAPTURED_IN_OBJECT', 'UNKNOWN_STATUS'], int] = <factory>)[source]
total_objects: int

Total number of objects of the type.

total_size_mb: float

Total size in mb.

total_num_workers: int

Total number of workers that reference the type of objects.

total_num_nodes: int

Total number of nodes that reference the type of objects.

task_state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int]

State name to the count dict. State name is equivalent to ObjectState.

ref_type_counts: Dict[typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], int]

Ref count type to the count dict. State name is equivalent to ObjectState.

State APIs Exceptions

class ray.experimental.state.exception.RayStateApiException(err_msg, *args)[source]