Ray State API
Contents
Ray State API#
Note
APIs are alpha. This feature requires a full installation of Ray using pip install "ray[default]"
.
For an overview with examples see Monitoring Ray States.
For the CLI reference see Ray State CLI Reference or Ray Log CLI Reference.
State Python SDK#
State APIs are also exported as functions.
Summary APIs#
- ray.experimental.state.api.summarize_actors(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict [source]#
Summarize the actors in cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout for requests made when getting the states.
raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
Dictionarified ActorSummaries
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.summarize_objects(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict [source]#
Summarize the objects in cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout for requests made when getting the states.
raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
Dictionarified ObjectSummaries
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.summarize_tasks(address: Optional[str] = None, timeout: int = 30, raise_on_missing_output: bool = True, _explain: bool = False) Dict [source]#
Summarize the tasks in cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout for requests made when getting the states.
raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
Dictionarified TaskSummaries
- Raises
Exceptions – RayStateApiException if the CLI is failed to query the data.
List APIs#
- ray.experimental.state.api.list_actors(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List actors in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("id", "=", "abcd")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
ActorState
) will be queried and returned. See ActorState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified ActorState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_placement_groups(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List placement groups in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("state", "=", "abcd")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
PlacementGroupState
) will be queried and returned. See PlacementGroupState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified PlacementGroupState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_nodes(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List nodes in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("node_name", "=", "abcd")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
NodeState
) will be queried and returned. See NodeState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified NodeState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_jobs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List jobs submitted to the cluster by :ref:
ray job submission
.- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("status", "=", "abcd")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
JobState
) will be queried and returned. See JobState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified JobState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_workers(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List workers in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("is_alive", "=", "True")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
WorkerState
) will be queried and returned. See WorkerState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified WorkerState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_tasks(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List tasks in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("is_alive", "=", "True")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
WorkerState
) will be queried and returned. See WorkerState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified WorkerState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_objects(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List objects in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("ip", "=", "0.0.0.0")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
ObjectState
) will be queried and returned. See ObjectState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified ObjectState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.list_runtime_envs(address: Optional[str] = None, filters: Optional[List[Tuple[str, str, Union[str, bool, int, float]]]] = None, limit: int = 100, timeout: int = 30, detail: bool = False, raise_on_missing_output: bool = True, _explain: bool = False) List[Dict] [source]#
List runtime environments in the cluster.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.filters – List of tuples of filter key, predicate (=, or !=), and the filter value. E.g.,
("node_id", "=", "abcdef")
limit – Max number of entries returned by the state backend.
timeout – Max timeout value for the state APIs requests made.
detail – When True, more details info (specified in
RuntimeEnvState
) will be queried and returned. See RuntimeEnvState.raise_on_missing_output – When True, exceptions will be raised if there is missing data due to truncation/data source unavailable.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified RuntimeEnvState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
Get APIs#
- ray.experimental.state.api.get_actor(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict] [source]#
Get an actor by id.
- Parameters
id – Id of the actor
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout value for the state API requests made.
_explain – Print the API information such as API latency or failed query information.
- Returns
None if actor not found, or dictionarified ActorState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.get_placement_group(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict] [source]#
Get a placement group by id.
- Parameters
id – Id of the placement group
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout value for the state APIs requests made.
_explain – Print the API information such as API latency or failed query information.
- Returns
None if actor not found, or dictionarified PlacementGroupState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.get_node(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict] [source]#
Get a node by id.
- Parameters
id – Id of the node.
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout value for the state APIs requests made.
_explain – Print the API information such as API latency or failed query information.
- Returns
None if actor not found, or dictionarified NodeState.
- Raises
Exceptions – RayStateApiException if the CLI is failed to query the data.
- ray.experimental.state.api.get_worker(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict] [source]#
Get a worker by id.
- Parameters
id – Id of the worker
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout value for the state APIs requests made.
_explain – Print the API information such as API latency or failed query information.
- Returns
None if actor not found, or dictionarified WorkerState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.get_task(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) Optional[Dict] [source]#
Get a task by id.
- Parameters
id – Id of the task
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout value for the state APIs requests made.
_explain – Print the API information such as API latency or failed query information.
- Returns
None if actor not found, or dictionarified TaskState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
- ray.experimental.state.api.get_objects(id: str, address: Optional[str] = None, timeout: int = 30, _explain: bool = False) List[Dict] [source]#
Get objects by id.
There could be more than 1 entry returned since an object could be referenced at different places.
- Parameters
id – Id of the object
address – Ray bootstrap address, could be
auto
,localhost:6379
. If None, it will be resolved automatically from an initialized ray.timeout – Max timeout value for the state APIs requests made.
_explain – Print the API information such as API latency or failed query information.
- Returns
List of dictionarified ObjectState.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
Log APIs#
- ray.experimental.state.api.list_logs(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, glob_filter: Optional[str] = None, timeout: int = 30) Dict[str, List[str]] [source]#
Listing log files available.
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If not specified, it will be retrieved from the initialized ray cluster.node_id – Id of the node containing the logs.
node_ip – Ip of the node containing the logs.
glob_filter – Name of the file (relative to the ray log directory) to be retrieved. E.g.
glob_filter="*worker*"
for all worker logs.actor_id – Id of the actor if getting logs from an actor.
timeout – Max timeout for requests made when getting the logs.
_interval – The interval in secs to print new logs when
follow=True
.
- Returns
A dictionary where the keys are log groups (e.g. gcs, raylet, worker), and values are list of log filenames.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data, or ConnectionError if failed to resolve the ray address.
- ray.experimental.state.api.get_log(address: Optional[str] = None, node_id: Optional[str] = None, node_ip: Optional[str] = None, filename: Optional[str] = None, actor_id: Optional[str] = None, task_id: Optional[str] = None, pid: Optional[int] = None, follow: bool = False, tail: int = 1000, timeout: int = 30, suffix: Optional[str] = None, _interval: Optional[float] = None) Generator[str, None, None] [source]#
Retrieve log file based on file name or some entities ids (pid, actor id, task id).
Examples
>>> import ray >>> from ray.experimental.state.api import get_log # To connect to an existing ray instance if there is >>> ray.init("auto") # Node IP could be retrieved from list_nodes() or ray.nodes() >>> node_ip = "172.31.47.143" >>> filename = "gcs_server.out" >>> for l in get_log(filename=filename, node_ip=node_ip): >>> print(l)
- Parameters
address – Ray bootstrap address, could be
auto
,localhost:6379
. If not specified, it will be retrieved from the initialized ray cluster.node_id – Id of the node containing the logs .
node_ip – Ip of the node containing the logs. (At least one of the node_id and node_ip have to be supplied when identifying a node).
filename – Name of the file (relative to the ray log directory) to be retrieved.
actor_id – Id of the actor if getting logs from an actor.
task_id – Id of the task if getting logs generated by a task.
pid – PID of the worker if getting logs generated by a worker. When querying with pid, either node_id or node_ip must be supplied.
follow – When set to True, logs will be streamed and followed.
tail – Number of lines to get from the end of the log file. Set to -1 for getting the entire log.
timeout – Max timeout for requests made when getting the logs.
suffix – The suffix of the log file if query by id of tasks/workers/actors.
_interval – The interval in secs to print new logs when
follow=True
.
- Returns
A Generator of log line, None for SendType and ReturnType.
- Raises
Exceptions – RayStateApiException if the CLI failed to query the data.
State APIs Schema#
ActorState#
- class ray.experimental.state.common.ActorState(actor_id: str, class_name: str, state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], job_id: str, name: Optional[str], node_id: str, pid: int, ray_namespace: str, serialized_runtime_env: str, required_resources: dict, death_cause: Optional[dict], is_detached: bool)[source]#
Actor State
Below columns can be used for the
--filter
option.actor_id
job_id
ray_namespace
state
node_id
pid
class_name
name
Below columns are available only when
get
API is used,--detail
is specified through CLI, ordetail=True
is given to Python APIs.serialized_runtime_env
death_cause
is_detached
required_resources
- actor_id: str#
The id of the actor.
- class_name: str#
The class name of the actor.
- state: typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD]#
The state of the actor.
DEPENDENCIES_UNREADY: Actor is waiting for dependency to be ready. E.g., a new actor is waiting for object ref that’s created from other remote task.
PENDING_CREATION: Actor’s dependency is ready, but it is not created yet. It could be because there are not enough resources, too many actor entries in the scheduler queue, or the actor creation is slow (e.g., slow runtime environment creation, slow worker startup, or etc.).
ALIVE: The actor is created, and it is alive.
RESTARTING: The actor is dead, and it is restarting. It is equivalent to
PENDING_CREATION
, but means the actor was dead more than once.DEAD: The actor is permanatly dead.
- job_id: str#
The job id of this actor.
- node_id: str#
The node id of this actor. If the actor is restarting, it could be the node id of the dead actor (and it will be re-updated when the actor is successfully restarted).
- pid: int#
The pid of the actor. 0 if it is not created yet.
- ray_namespace: str#
The namespace of the actor.
- serialized_runtime_env: str#
The runtime environment information of the actor.
- required_resources: dict#
The resource requirement of the actor.
- death_cause: Optional[dict]#
Actor’s death information in detail. None if the actor is not dead yet.
- is_detached: bool#
True if the actor is detached. False otherwise.
TaskState#
- class ray.experimental.state.common.TaskState(task_id: str, name: str, scheduling_state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], job_id: str, node_id: str, actor_id: str, type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK], func_or_class_name: str, language: str, required_resources: dict, runtime_env_info: str)[source]#
Task State
Below columns can be used for the
--filter
option.actor_id
job_id
scheduling_state
node_id
language
type
func_or_class_name
name
task_id
Below columns are available only when
get
API is used,--detail
is specified through CLI, ordetail=True
is given to Python APIs.language
required_resources
runtime_env_info
- task_id: str#
The id of the task.
- name: str#
The name of the task if it is given by the name argument.
- scheduling_state: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED]#
The state of the task.
Refer to src/ray/protobuf/common.proto for a detailed explanation of the state breakdowns and typical state transition flow.
- job_id: str#
The job id of this task.
- node_id: str#
Id of the node that runs the task. If the task is retried, it could contain the node id of the previous executed task. If empty, it means the task hasn’t been scheduled yet.
- actor_id: str#
The actor id that’s associated with this task. It is empty if there’s no relevant actors.
- type: typing_extensions.Literal[NORMAL_TASK, ACTOR_CREATION_TASK, ACTOR_TASK, DRIVER_TASK]#
The type of the task.
NORMAL_TASK: Tasks created by
func.remote()`
ACTOR_CREATION_TASK: Actors created by
class.remote()
ACTOR_TASK: Actor tasks submitted by
actor.method.remote()
DRIVER_TASK: Driver (A script that calls
ray.init
).
- func_or_class_name: str#
The name of the task. If is the name of the function if the type is a task or an actor task. It is the name of the class if it is a actor scheduling task.
- language: str#
The language of the task. E.g., Python, Java, or Cpp.
- required_resources: dict#
The required resources to execute the task.
- runtime_env_info: str#
The runtime environment information for the task.
NodeState#
- class ray.experimental.state.common.NodeState(node_id: str, node_ip: str, state: typing_extensions.Literal[ALIVE, DEAD], node_name: str, resources_total: dict)[source]#
Node State
Below columns can be used for the
--filter
option.node_id
node_ip
node_name
state
- node_id: str#
The id of the node.
- node_ip: str#
The ip address of the node.
- state: typing_extensions.Literal[ALIVE, DEAD]#
The state of the node.
ALIVE: The node is alive. DEAD: The node is dead.
- node_name: str#
The name of the node if it is given by the name argument.
- resources_total: dict#
The total resources of the node.
PlacementGroupState#
- class ray.experimental.state.common.PlacementGroupState(placement_group_id: str, name: str, state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING], bundles: dict, is_detached: bool, stats: dict)[source]#
PlacementGroup State
Below columns can be used for the
--filter
option.is_detached
name
placement_group_id
state
Below columns are available only when
get
API is used,--detail
is specified through CLI, ordetail=True
is given to Python APIs.bundles
is_detached
stats
- placement_group_id: str#
The id of the placement group.
- name: str#
The name of the placement group if it is given by the name argument.
- state: typing_extensions.Literal[PENDING, CREATED, REMOVED, RESCHEDULING]#
The state of the placement group.
PENDING: The placement group creation is pending scheduling. It could be because there’s not enough resources, some of creation stage has failed (e.g., failed to commit placement gropus because the node is dead).
CREATED: The placement group is created.
REMOVED: The placement group is removed.
RESCHEDULING: The placement group is rescheduling because some of bundles are dead because they were on dead nodes.
- bundles: dict#
The bundle specification of the placement group.
- is_detached: bool#
True if the placement group is detached. False otherwise.
- stats: dict#
The scheduling stats of the placement group.
WorkerState#
- class ray.experimental.state.common.WorkerState(worker_id: str, is_alive: bool, worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]], node_id: str, ip: str, pid: str, exit_detail: Optional[str])[source]#
Worker State
Below columns can be used for the
--filter
option.exit_type
worker_id
node_id
worker_type
pid
is_alive
ip
Below columns are available only when
get
API is used,--detail
is specified through CLI, ordetail=True
is given to Python APIs.exit_detail
- worker_id: str#
The id of the worker.
- is_alive: bool#
Whether or not if the worker is alive.
- worker_type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER]#
The driver (Python script that calls
ray.init
). - SPILL_WORKER: The worker that spills objects. - RESTORE_WORKER: The worker that restores objects.- Type
DRIVER
- exit_type: Optional[typing_extensions.Literal[SYSTEM_ERROR, INTENDED_SYSTEM_EXIT, USER_ERROR, INTENDED_USER_EXIT, NODE_OUT_OF_MEMORY]]#
The exit type of the worker if the worker is dead.
SYSTEM_ERROR: Worker exit due to system level failures (i.e. worker crash).
INTENDED_SYSTEM_EXIT: System-level exit that is intended. E.g., Workers are killed because they are idle for a long time.
USER_ERROR: Worker exits because of user error. E.g., execptions from the actor initialization.
INTENDED_USER_EXIT: Intended exit from users (e.g., users exit workers with exit code 0 or exit initated by Ray API such as ray.kill).
- node_id: str#
The node id of the worker.
- ip: str#
The ip address of the worker.
- pid: str#
The pid of the worker.
- exit_detail: Optional[str]#
The exit detail of the worker if the worker is dead.
ObjectState#
- class ray.experimental.state.common.ObjectState(object_id: str, object_size: int, task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], call_site: str, type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER], pid: int, ip: str)[source]#
Object State
Below columns can be used for the
--filter
option.task_status
object_size
ip
reference_type
pid
type
call_site
object_id
- object_id: str#
The id of the object.
- object_size: int#
The size of the object in mb.
- task_status: typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED]#
The status of the task that creates the object.
NIL: We don’t have a status for this task because we are not the owner or the task metadata has already been deleted.
WAITING_FOR_DEPENDENCIES: The task is waiting for its dependencies to be created.
SCHEDULED: All dependencies have been created and the task is scheduled to execute. It could be because the task is waiting for resources, runtime environmenet creation, fetching dependencies to the local node, and etc..
FINISHED: The task finished successfully.
WAITING_FOR_EXECUTION: The task is scheduled properly and waiting for execution. It includes time to deliver the task to the remote worker + queueing time from the execution side.
RUNNING: The task that is running.
- reference_type: typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS]#
The reference type of the object. See Debugging with Ray Memory for more details.
ACTOR_HANDLE: The reference is an actor handle.
PINNED_IN_MEMORY: The object is pinned in memory, meaning there’s in-flight
ray.get
on this reference.LOCAL_REFERENCE: There’s a local reference (e.g., Python reference) to this object reference. The object won’t be GC’ed until all of them is gone.
USED_BY_PENDING_TASK: The object reference is passed to other tasks. E.g.,
a = ray.put()
->task.remote(a)
. In this case, a is used by a pending tasktask
.CAPTURED_IN_OBJECT: The object is serialized by other objects. E.g.,
a = ray.put(1)
->b = ray.put([a])
. a is serialized within a list.UNKNOWN_STATUS: The object ref status is unkonwn.
- call_site: str#
The callsite of the object.
- type: typing_extensions.Literal[WORKER, DRIVER, SPILL_WORKER, RESTORE_WORKER]#
The worker type that creates the object.
WORKER: The regular Ray worker process that executes tasks or instantiates an actor.
DRIVER: The driver (Python script that calls
ray.init
).SPILL_WORKER: The worker that spills objects.
RESTORE_WORKER: The worker that restores objects.
- pid: int#
The pid of the owner.
- ip: str#
The ip address of the owner.
RuntimeEnvState#
- class ray.experimental.state.common.RuntimeEnvState(runtime_env: str, success: bool, creation_time_ms: Optional[float], node_id: str, ref_cnt: int, error: Optional[str])[source]#
Runtime Environment State
Below columns can be used for the
--filter
option.node_id
error
success
runtime_env
Below columns are available only when
get
API is used,--detail
is specified through CLI, ordetail=True
is given to Python APIs.ref_cnt
error
- runtime_env: str#
The runtime environment spec.
- success: bool#
Whether or not the runtime env creation has succeeded.
- creation_time_ms: Optional[float]#
The latency of creating the runtime environment. Available if the runtime env is successfully created.
- node_id: str#
The node id of this runtime environment.
- ref_cnt: int#
The number of actors and tasks that use this runtime environment.
- error: Optional[str]#
The error message if the runtime environment creation has failed. Available if the runtime env is failed to be created.
JobState#
- class ray.experimental.state.common.JobState(status: ray.dashboard.modules.job.common.JobStatus, entrypoint: str, message: Optional[str] = None, error_type: Optional[str] = None, start_time: Optional[int] = None, end_time: Optional[int] = None, metadata: Optional[Dict[str, str]] = None, runtime_env: Optional[Dict[str, Any]] = None, entrypoint_num_cpus: Optional[Union[int, float]] = None, entrypoint_num_gpus: Optional[Union[int, float]] = None, entrypoint_resources: Optional[Dict[str, float]] = None, driver_agent_http_address: Optional[str] = None, driver_node_id: Optional[str] = None)[source]#
The state of the job that’s submitted by Ray’s Job APIs
Below columns can be used for the
--filter
option.status
entrypoint
error_type
StateSummary#
- class ray.experimental.state.common.StateSummary(node_id_to_summary: Dict[str, Union[ray.experimental.state.common.TaskSummaries, ray.experimental.state.common.ActorSummaries, ray.experimental.state.common.ObjectSummaries]])[source]#
- node_id_to_summary: Dict[str, Union[ray.experimental.state.common.TaskSummaries, ray.experimental.state.common.ActorSummaries, ray.experimental.state.common.ObjectSummaries]]#
Node ID -> summary per node If the data is not required to be orgnized per node, it will contain a single key, “cluster”.
TaskSummary#
- class ray.experimental.state.common.TaskSummaries(summary: Dict[str, ray.experimental.state.common.TaskSummaryPerFuncOrClassName], total_tasks: int, total_actor_tasks: int, total_actor_scheduled: int, summary_by: str = 'func_name')[source]#
- total_tasks: int#
Total Ray tasks.
- total_actor_tasks: int#
Total actor tasks.
- total_actor_scheduled: int#
Total scheduled actors.
- class ray.experimental.state.common.TaskSummaryPerFuncOrClassName(func_or_class_name: str, type: str, state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = <factory>)[source]#
- func_or_class_name: str#
The function or class name of this task.
- type: str#
The type of the class. Equivalent to protobuf TaskType.
- state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int]#
State name to the count dict. State name is equivalent to the protobuf TaskStatus.
ActorSummary#
- class ray.experimental.state.common.ActorSummaries(summary: Dict[str, ray.experimental.state.common.ActorSummaryPerClass], total_actors: int, summary_by: str = 'class')[source]#
- summary: Dict[str, ray.experimental.state.common.ActorSummaryPerClass]#
Group key (actor class name) -> summary
- total_actors: int#
Total number of actors
- class ray.experimental.state.common.ActorSummaryPerClass(class_name: str, state_counts: Dict[typing_extensions.Literal['DEPENDENCIES_UNREADY', 'PENDING_CREATION', 'ALIVE', 'RESTARTING', 'DEAD'], int] = <factory>)[source]#
- class_name: str#
The class name of the actor.
- state_counts: Dict[typing_extensions.Literal[DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING, DEAD], int]#
State name to the count dict. State name is equivalent to the protobuf ActorState.
ObjectSummary#
- class ray.experimental.state.common.ObjectSummaries(summary: Dict[str, ray.experimental.state.common.ObjectSummaryPerKey], total_objects: int, total_size_mb: float, callsite_enabled: bool, summary_by: str = 'callsite')[source]#
- summary: Dict[str, ray.experimental.state.common.ObjectSummaryPerKey]#
Group key (actor class name) -> summary
- total_objects: int#
Total number of referenced objects in the cluster.
- total_size_mb: float#
Total size of referenced objects in the cluster in MB.
- callsite_enabled: bool#
Whether or not the callsite collection is enabled.
- class ray.experimental.state.common.ObjectSummaryPerKey(total_objects: int, total_size_mb: float, total_num_workers: int, total_num_nodes: int, task_state_counts: Dict[typing_extensions.Literal['NIL', 'PENDING_ARGS_AVAIL', 'PENDING_NODE_ASSIGNMENT', 'PENDING_OBJ_STORE_MEM_AVAIL', 'PENDING_ARGS_FETCH', 'SUBMITTED_TO_WORKER', 'RUNNING', 'RUNNING_IN_RAY_GET', 'RUNNING_IN_RAY_WAIT', 'FINISHED', 'FAILED'], int] = <factory>, ref_type_counts: Dict[typing_extensions.Literal['ACTOR_HANDLE', 'PINNED_IN_MEMORY', 'LOCAL_REFERENCE', 'USED_BY_PENDING_TASK', 'CAPTURED_IN_OBJECT', 'UNKNOWN_STATUS'], int] = <factory>)[source]#
- total_objects: int#
Total number of objects of the type.
- total_size_mb: float#
Total size in mb.
- total_num_workers: int#
Total number of workers that reference the type of objects.
- total_num_nodes: int#
Total number of nodes that reference the type of objects.
- task_state_counts: Dict[typing_extensions.Literal[NIL, PENDING_ARGS_AVAIL, PENDING_NODE_ASSIGNMENT, PENDING_OBJ_STORE_MEM_AVAIL, PENDING_ARGS_FETCH, SUBMITTED_TO_WORKER, RUNNING, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT, FINISHED, FAILED], int]#
State name to the count dict. State name is equivalent to ObjectState.
- ref_type_counts: Dict[typing_extensions.Literal[ACTOR_HANDLE, PINNED_IN_MEMORY, LOCAL_REFERENCE, USED_BY_PENDING_TASK, CAPTURED_IN_OBJECT, UNKNOWN_STATUS], int]#
Ref count type to the count dict. State name is equivalent to ObjectState.