Ray State API#

Note

APIs are alpha. This feature requires a full installation of Ray using pip install "ray[default]".

For an overview with examples see Monitoring Ray States.

For the CLI reference see Ray State CLI Reference or Ray Log CLI Reference.

State Python SDK#

State APIs are also exported as functions.

Summary APIs#

ray.experimental.state.api.summarize_actors(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict[source]#

Summarize the actors in cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout for requests made when getting the states.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

Dictionarified ActorSummaries

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.summarize_objects(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict[source]#

Summarize the objects in cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout for requests made when getting the states.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

Dictionarified ObjectSummaries

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.summarize_tasks(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict[source]#

Summarize the tasks in cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout for requests made when getting the states.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

Dictionarified TaskSummaries

Raises

Exceptions – RayStateApiException if the CLI is failed to query the data.

List APIs#

ray.experimental.state.api.list_actors(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List actors in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("id", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in ActorState) will be queried and returned. See ActorState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified ActorState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_placement_groups(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List placement groups in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("state", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in PlacementGroupState) will be queried and returned. See PlacementGroupState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified PlacementGroupState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_nodes(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List nodes in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("node_name", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in NodeState) will be queried and returned. See NodeState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified NodeState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_jobs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List jobs submitted to the cluster by :ref: ray job submission.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("status", "=", "abcd")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in JobState) will be queried and returned. See JobState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified JobState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_workers(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List workers in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("is_alive", "=", "True")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in WorkerState) will be queried and returned. See WorkerState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified WorkerState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_tasks(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List tasks in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("is_alive", "=", "True")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in WorkerState) will be queried and returned. See WorkerState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified WorkerState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_objects(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List objects in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("ip", "=", "0.0.0.0")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in ObjectState) will be queried and returned. See ObjectState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified ObjectState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.list_runtime_envs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict][source]#

List runtime environments in the cluster.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g., ("node_id", "=", "abcdef")

  • limit – Max number of entries returned by the state backend.

  • timeout – Max timeout value for the state APIs requests made.

  • detail – When True, more details info (specified in RuntimeEnvState) will be queried and returned. See RuntimeEnvState.

  • raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified RuntimeEnvState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

Get APIs#

ray.experimental.state.api.get_actor(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]#

Get an actor by id.

Parameters
  • id – Id of the actor

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state API requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified ActorState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_placement_group(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]#

Get a placement group by id.

Parameters
  • id – Id of the placement group

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified PlacementGroupState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_node(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]#

Get a node by id.

Parameters
  • id – Id of the node.

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified NodeState.

Raises

Exceptions – RayStateApiException if the CLI is failed to query the data.

ray.experimental.state.api.get_worker(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]#

Get a worker by id.

Parameters
  • id – Id of the worker

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified WorkerState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_task(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict][source]#

Get a task by id.

Parameters
  • id – Id of the task

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

None if actor not found, or dictionarified TaskState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

ray.experimental.state.api.get_objects(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) List[Dict][source]#

Get objects by id.

There could be more than 1 entry returned since an object could be referenced at different places.

Parameters
  • id – Id of the object

  • address – Ray bootstrap address, could be auto, localhost:6379. If None, it will be resolved automatically from an initialized ray.

  • timeout – Max timeout value for the state APIs requests made.

  • _explain – Print the API information such as API latency or failed query information.

Returns

List of dictionarified ObjectState.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

Log APIs#

ray.experimental.state.api.list_logs(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, glob_filter: Optional[str] = None, timeout: int = 30) Dict[str, List[str]][source]#

Listing log files available.

Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If not specified, it will be retrieved from the initialized ray cluster.

  • node_id – Id of the node containing the logs.

  • node_ip – Ip of the node containing the logs.

  • glob_filter – Name of the file (relative to the ray log directory) to be retrieved. E.g. glob_filter="*worker*" for all worker logs.

  • actor_id – Id of the actor if getting logs from an actor.

  • timeout – Max timeout for requests made when getting the logs.

  • _interval – The interval in secs to print new logs when follow=True.

Returns

A dictionary where the keys are log groups (e.g. gcs, raylet, worker), and values are list of log filenames.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data, or ConnectionError if failed to resolve the ray address.

ray.experimental.state.api.get_log(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, filename: Optional[str] = None, actor_id: Optional[str] = None, task_id: Optional[str] = None, pid: Optional[int] = None, follow: bool = False, tail: int = 1000, timeout: int = 30, suffix: Optional[str] = None, _interval: Optional[float] = None) Generator[str, None, None][source]#

Retrieve log file based on file name or some entities ids (pid, actor id, task id).

Examples

>>> import ray
>>> from ray.experimental.state.api import get_log 
# To connect to an existing ray instance if there is
>>> ray.init("auto") 
# Node IP could be retrieved from list_nodes() or ray.nodes()
>>> node_ip = "172.31.47.143"  
>>> filename = "gcs_server.out" 
>>> for l in get_log(filename=filename, node_ip=node_ip): 
>>>    print(l) 
Parameters
  • address – Ray bootstrap address, could be auto, localhost:6379. If not specified, it will be retrieved from the initialized ray cluster.

  • node_id – Id of the node containing the logs .

  • node_ip – Ip of the node containing the logs. (At least one of the node_id and node_ip have to be supplied when identifying a node).

  • filename – Name of the file (relative to the ray log directory) to be retrieved.

  • actor_id – Id of the actor if getting logs from an actor.

  • task_id – Id of the task if getting logs generated by a task.

  • pid – PID of the worker if getting logs generated by a worker. When querying with pid, either node_id or node_ip must be supplied.

  • follow – When set to True, logs will be streamed and followed.

  • tail – Number of lines to get from the end of the log file. Set to -1 for getting the entire log.

  • timeout – Max timeout for requests made when getting the logs.

  • suffix – The suffix of the log file if query by id of tasks/workers/actors.

  • _interval – The interval in secs to print new logs when follow=True.

Returns

A Generator of log line, None for SendType and ReturnType.

Raises

Exceptions – RayStateApiException if the CLI failed to query the data.

State APIs Schema#

ActorState#

class ray.experimental.state.common.ActorState(actor_id: str, class_name: str, state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], job_id: str, name: Optional[str], node_id: str, pid: int, ray_namespace: str, serialized_runtime_env: str, required_resources: dict, death_cause: Optional[dict], is_detached: bool)[source]#

Actor State

Below columns can be used for the --filter option.

actor_id

job_id

ray_namespace

state

node_id

pid

class_name

name

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

serialized_runtime_env

death_cause

is_detached

required_resources

actor_id: str#

The id of the actor.

class_name: str#

The class name of the actor.

state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD]#

The state of the actor.

  • DEPENDENCIES_UNREADY: Actor is waiting for dependency to be ready. E.g., a new actor is waiting for object ref that’s created from other remote task.

  • PENDING_CREATION: Actor’s dependency is ready, but it is not created yet. It could be because there are not enough resources, too many actor entries in the scheduler queue, or the actor creation is slow (e.g., slow runtime environment creation, slow worker startup, or etc.).

  • ALIVE: The actor is created, and it is alive.

  • RESTARTING: The actor is dead, and it is restarting. It is equivalent to PENDING_CREATION, but means the actor was dead more than once.

  • DEAD: The actor is permanatly dead.

job_id: str#

The job id of this actor.

name: Optional[str]#

The name of the actor given by the name argument.

node_id: str#

The node id of this actor. If the actor is restarting, it could be the node id of the dead actor (and it will be re-updated when the actor is successfully restarted).

pid: int#

The pid of the actor. 0 if it is not created yet.

ray_namespace: str#

The namespace of the actor.

serialized_runtime_env: str#

The runtime environment information of the actor.

required_resources: dict#

The resource requirement of the actor.

death_cause: Optional[dict]#

Actor’s death information in detail. None if the actor is not dead yet.

is_detached: bool#

True if the actor is detached. False otherwise.

TaskState#

class ray.experimental.state.common.TaskState(task_id: str, name: str, scheduling_state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], job_id: str, node_id: str, actor_id: str, type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK], func_or_class_name: str, language: str, required_resources: dict, runtime_env_info: str)[source]#

Task State

Below columns can be used for the --filter option.

actor_id

job_id

scheduling_state

node_id

language

type

func_or_class_name

name

task_id

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

language

required_resources

runtime_env_info

task_id: str#

The id of the task.

name: str#

The name of the task if it is given by the name argument.

scheduling_state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED]#

The state of the task.

Refer to src/ray/protobuf/common.proto for a detailed explanation of the state breakdowns and typical state transition flow.

job_id: str#

The job id of this task.

node_id: str#

Id of the node that runs the task. If the task is retried, it could contain the node id of the previous executed task. If empty, it means the task hasn’t been scheduled yet.

actor_id: str#

The actor id that’s associated with this task. It is empty if there’s no relevant actors.

type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK]#

The type of the task.

  • NORMAL_TASK: Tasks created by func.remote()`

  • ACTOR_CREATION_TASK: Actors created by class.remote()

  • ACTOR_TASK: Actor tasks submitted by actor.method.remote()

  • DRIVER_TASK: Driver (A script that calls ray.init).

func_or_class_name: str#

The name of the task. If is the name of the function if the type is a task or an actor task. It is the name of the class if it is a actor scheduling task.

language: str#

The language of the task. E.g., Python, Java, or Cpp.

required_resources: dict#

The required resources to execute the task.

runtime_env_info: str#

The runtime environment information for the task.

NodeState#

class ray.experimental.state.common.NodeState(node_id: str, node_ip: str, state: typing_extensions.Literal[ALIVE, DEAD], node_name: str, resources_total: dict)[source]#

Node State

Below columns can be used for the --filter option.

node_id

node_ip

node_name

state

node_id: str#

The id of the node.

node_ip: str#

The ip address of the node.

state: typing_extensions.Literal[ALIVE, DEAD]#

The state of the node.

ALIVE: The node is alive. DEAD: The node is dead.

node_name: str#

The name of the node if it is given by the name argument.

resources_total: dict#

The total resources of the node.

PlacementGroupState#

class ray.experimental.state.common.PlacementGroupState(placement_group_id: str, name: str, state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING], bundles: dict, is_detached: bool, stats: dict)[source]#

PlacementGroup State

Below columns can be used for the --filter option.

is_detached

name

placement_group_id

state

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

bundles

is_detached

stats

placement_group_id: str#

The id of the placement group.

name: str#

The name of the placement group if it is given by the name argument.

state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING]#

The state of the placement group.

  • PENDING: The placement group creation is pending scheduling. It could be because there’s not enough resources, some of creation stage has failed (e.g., failed to commit placement gropus because the node is dead).

  • CREATED: The placement group is created.

  • REMOVED: The placement group is removed.

  • RESCHEDULING: The placement group is rescheduling because some of bundles are dead because they were on dead nodes.

bundles: dict#

The bundle specification of the placement group.

is_detached: bool#

True if the placement group is detached. False otherwise.

stats: dict#

The scheduling stats of the placement group.

WorkerState#

class ray.experimental.state.common.WorkerState(worker_id: str, is_alive: bool, worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]], node_id: str, ip: str, pid: str, exit_detail: Optional[str])[source]#

Worker State

Below columns can be used for the --filter option.

exit_type

worker_id

node_id

worker_type

pid

is_alive

ip

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

exit_detail

worker_id: str#

The id of the worker.

is_alive: bool#

Whether or not if the worker is alive.

worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER]#

The driver (Python script that calls ray.init). - SPILL_WORKER: The worker that spills objects. - RESTORE_WORKER: The worker that restores objects.

Type
  • DRIVER

exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]]#

The exit type of the worker if the worker is dead.

  • SYSTEM_ERROR: Worker exit due to system level failures (i.e. worker crash).

  • INTENDED_SYSTEM_EXIT: System-level exit that is intended. E.g., Workers are killed because they are idle for a long time.

  • USER_ERROR: Worker exits because of user error. E.g., execptions from the actor initialization.

  • INTENDED_USER_EXIT: Intended exit from users (e.g., users exit workers with exit code 0 or exit initated by Ray API such as ray.kill).

node_id: str#

The node id of the worker.

ip: str#

The ip address of the worker.

pid: str#

The pid of the worker.

exit_detail: Optional[str]#

The exit detail of the worker if the worker is dead.

ObjectState#

class ray.experimental.state.common.ObjectState(object_id: str, object_size: int, task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], call_site: str, type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], pid: int, ip: str)[source]#

Object State

Below columns can be used for the --filter option.

task_status

object_size

ip

reference_type

pid

type

call_site

object_id

object_id: str#

The id of the object.

object_size: int#

The size of the object in mb.

task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED]#

The status of the task that creates the object.

  • NIL: We don’t have a status for this task because we are not the owner or the task metadata has already been deleted.

  • WAITING_FOR_DEPENDENCIES: The task is waiting for its dependencies to be created.

  • SCHEDULED: All dependencies have been created and the task is scheduled to execute. It could be because the task is waiting for resources, runtime environmenet creation, fetching dependencies to the local node, and etc..

  • FINISHED: The task finished successfully.

  • WAITING_FOR_EXECUTION: The task is scheduled properly and waiting for execution. It includes time to deliver the task to the remote worker + queueing time from the execution side.

  • RUNNING: The task that is running.

reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS]#

The reference type of the object. See Debugging with Ray Memory for more details.

  • ACTOR_HANDLE: The reference is an actor handle.

  • PINNED_IN_MEMORY: The object is pinned in memory, meaning there’s in-flight ray.get on this reference.

  • LOCAL_REFERENCE: There’s a local reference (e.g., Python reference) to this object reference. The object won’t be GC’ed until all of them is gone.

  • USED_BY_PENDING_TASK: The object reference is passed to other tasks. E.g., a = ray.put() -> task.remote(a). In this case, a is used by a pending task task.

  • CAPTURED_IN_OBJECT: The object is serialized by other objects. E.g., a = ray.put(1) -> b = ray.put([a]). a is serialized within a list.

  • UNKNOWN_STATUS: The object ref status is unkonwn.

call_site: str#

The callsite of the object.

type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER]#

The worker type that creates the object.

  • WORKER: The regular Ray worker process that executes tasks or instantiates an actor.

  • DRIVER: The driver (Python script that calls ray.init).

  • SPILL_WORKER: The worker that spills objects.

  • RESTORE_WORKER: The worker that restores objects.

pid: int#

The pid of the owner.

ip: str#

The ip address of the owner.

RuntimeEnvState#

class ray.experimental.state.common.RuntimeEnvState(runtime_env: str, success: bool, creation_time_ms: Optional[float], node_id: str, ref_cnt: int, error: Optional[str])[source]#

Runtime Environment State

Below columns can be used for the --filter option.

node_id

error

success

runtime_env

Below columns are available only when get API is used,

--detail is specified through CLI, or detail=True is given to Python APIs.

ref_cnt

error

runtime_env: str#

The runtime environment spec.

success: bool#

Whether or not the runtime env creation has succeeded.

creation_time_ms: Optional[float]#

The latency of creating the runtime environment. Available if the runtime env is successfully created.

node_id: str#

The node id of this runtime environment.

ref_cnt: int#

The number of actors and tasks that use this runtime environment.

error: Optional[str]#

The error message if the runtime environment creation has failed. Available if the runtime env is failed to be created.

JobState#

class ray.experimental.state.common.JobState(status: ray.dashboard.modules.job.common.JobStatus, entrypoint: str, message: Optional[str] = None, error_type: Optional[str] = None, start_time: Optional[int] = None, end_time: Optional[int] = None, metadata: Optional[Dict[str, str]] = None, runtime_env: Optional[Dict[str, Any]] = None, entrypoint_num_cpus: Optional[Union[int, float]] = None, entrypoint_num_gpus: Optional[Union[int, float]] = None, entrypoint_resources: Optional[Dict[str, float]] = None, driver_agent_http_address: Optional[str] = None, driver_node_id: Optional[str] = None)[source]#

The state of the job that’s submitted by Ray’s Job APIs

Below columns can be used for the --filter option.

status

entrypoint

error_type

classmethod list_columns() List[str][source]#

Return a list of columns.

classmethod filterable_columns() Set[str][source]#

Return a list of filterable columns

StateSummary#

class ray.experimental.state.common.StateSummary(node_id_to_summary: Dict[str, Union[ray.experimental.state.common.TaskSummaries, ray.experimental.state.common.ActorSummaries, ray.experimental.state.common.ObjectSummaries]])[source]#
node_id_to_summary: Dict[str, Union[ray.experimental.state.common.TaskSummaries, ray.experimental.state.common.ActorSummaries, ray.experimental.state.common.ObjectSummaries]]#

Node ID -> summary per node If the data is not required to be orgnized per node, it will contain a single key, β€œcluster”.

TaskSummary#

class ray.experimental.state.common.TaskSummaries(summary: Dict[str, ray.experimental.state.common.TaskSummaryPerFuncOrClassName], total_tasks: int, total_actor_tasks: int, total_actor_scheduled: int, summary_by: str = 'func_name')[source]#
total_tasks: int#

Total Ray tasks.

total_actor_tasks: int#

Total actor tasks.

total_actor_scheduled: int#

Total scheduled actors.

class ray.experimental.state.common.TaskSummaryPerFuncOrClassName(func_or_class_name: str, type: str, state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = <factory>)[source]#
func_or_class_name: str#

The function or class name of this task.

type: str#

The type of the class. Equivalent to protobuf TaskType.

state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int]#

State name to the count dict. State name is equivalent to the protobuf TaskStatus.

ActorSummary#

class ray.experimental.state.common.ActorSummaries(summary: Dict[str, ray.experimental.state.common.ActorSummaryPerClass], total_actors: int, summary_by: str = 'class')[source]#
summary: Dict[str, ray.experimental.state.common.ActorSummaryPerClass]#

Group key (actor class name) -> summary

total_actors: int#

Total number of actors

class ray.experimental.state.common.ActorSummaryPerClass(class_name: str, state_counts: Dict[typing_extensions.Literal['DEPENDENCIES_UNREADY', 'PENDING_CREATION', 'ALIVE', 'RESTARTING', 'DEAD'], int] = <factory>)[source]#
class_name: str#

The class name of the actor.

state_counts: Dict[typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], int]#

State name to the count dict. State name is equivalent to the protobuf ActorState.

ObjectSummary#

class ray.experimental.state.common.ObjectSummaries(summary: Dict[str, ray.experimental.state.common.ObjectSummaryPerKey], total_objects: int, total_size_mb: float, callsite_enabled: bool, summary_by: str = 'callsite')[source]#
summary: Dict[str, ray.experimental.state.common.ObjectSummaryPerKey]#

Group key (actor class name) -> summary

total_objects: int#

Total number of referenced objects in the cluster.

total_size_mb: float#

Total size of referenced objects in the cluster in MB.

callsite_enabled: bool#

Whether or not the callsite collection is enabled.

class ray.experimental.state.common.ObjectSummaryPerKey(total_objects: int, total_size_mb: float, total_num_workers: int, total_num_nodes: int, task_state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = <factory>, ref_type_counts: Dict[typing_extensions.Literal['ACTOR_HANDLE', 'PINNED_IN_MEMORY', 'LOCAL_REFERENCE', 'USED_BY_PENDING_TASK', 'CAPTURED_IN_OBJECT', 'UNKNOWN_STATUS'], int] = <factory>)[source]#
total_objects: int#

Total number of objects of the type.

total_size_mb: float#

Total size in mb.

total_num_workers: int#

Total number of workers that reference the type of objects.

total_num_nodes: int#

Total number of nodes that reference the type of objects.

task_state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int]#

State name to the count dict. State name is equivalent to ObjectState.

ref_type_counts: Dict[typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], int]#

Ref count type to the count dict. State name is equivalent to ObjectState.

State APIs Exceptions#

class ray.experimental.state.exception.RayStateApiException(err_msg, *args)[source]#