Distributed LightGBM on Ray¶
LightGBM-Ray is a distributed backend for LightGBM, built on top of distributed computing framework Ray.
LightGBM-Ray
enables multi-node and multi-GPU training
integrates seamlessly with distributed hyperparameter optimization library Ray Tune
comes with fault tolerance handling mechanisms, and
supports distributed dataframes and distributed data loading
All releases are tested on large clusters and workloads.
This package is based on XGBoost-Ray. As of now, XGBoost-Ray is a dependency for LightGBM-Ray.
Installation¶
You can install the latest LightGBM-Ray release from PIP:
pip install lightgbm_ray
If you’d like to install the latest master, use this command instead:
pip install git+https://github.com/ray-project/lightgbm_ray.git#lightgbm_ray
Usage¶
LightGBM-Ray provides a drop-in replacement for LightGBM’s train
function. To pass data, a RayDMatrix
object is required, common
with XGBoost-Ray. You can also use a scikit-learn
interface - see next section.
Just as in original lgbm.train()
function, the
training parameters
are passed as the params
dictionary.
Ray-specific distributed training parameters are configured with a
lightgbm_ray.RayParams
object. For instance, you can set
the num_actors
property to specify how many distributed actors
you would like to use.
Here is a simplified example (which requires sklearn
):
Training:
from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train(
{
"objective": "binary",
"metric": ["binary_logloss", "binary_error"],
},
train_set,
evals_result=evals_result,
valid_sets=[train_set],
valid_names=["train"],
verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=2))
bst.booster_.save_model("model.lgbm")
print("Final training error: {:.4f}".format(
evals_result["train"]["binary_error"][-1]))
Prediction:
from lightgbm_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import lightgbm as lgbm
data, labels = load_breast_cancer(return_X_y=True)
dpred = RayDMatrix(data, labels)
bst = lgbm.Booster(model_file="model.lgbm")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))
print(pred_ray)
scikit-learn API¶
LightGBM-Ray also features a scikit-learn API fully mirroring pure LightGBM scikit-learn API, providing a completely drop-in replacement. The following estimators are available:
RayLGBMClassifier
RayLGBMRegressor
Example usage of RayLGBMClassifier
:
from lightgbm_ray import RayLGBMClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
seed = 42
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.25, random_state=42)
clf = RayLGBMClassifier(
n_jobs=2, # In LightGBM-Ray, n_jobs sets the number of actors
random_state=seed)
# scikit-learn API will automatically convert the data
# to RayDMatrix format as needed.
# You can also pass X as a RayDMatrix, in which case
# y will be ignored.
clf.fit(X_train, y_train)
pred_ray = clf.predict(X_test)
print(pred_ray)
pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)
# It is also possible to pass a RayParams object
# to fit/predict/predict_proba methods - will override
# n_jobs set during initialization
clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))
pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
print(pred_ray)
Things to keep in mind:
n_jobs
parameter controls the number of actors spawned. You can pass aRayParams
object to thefit
/predict
/predict_proba
methods as theray_params
argument for greater control over resource allocation. Doing so will override the value ofn_jobs
with the value ofray_params.num_actors
attribute. For more information, refer to the Resources section below.By default
n_jobs
is set to1
, which means the training will not be distributed. Make sure to either setn_jobs
to a higher value or pass aRayParams
object as outlined above in order to take advantage of LightGBM-Ray’s functionality.After calling
fit
, additional evaluation results (e.g. training time, number of rows, callback results) will be available underadditional_results_
attribute.eval_
arguments are supported, but early stopping is not.LightGBM-Ray’s scikit-learn API is based on LightGBM 3.2.1. While we try to support older LightGBM versions, please note that this library is only fully tested and supported for LightGBM >= 3.2.1.
For more information on the scikit-learn API, refer to the LightGBM documentation.
Data loading¶
Data is passed to LightGBM-Ray via a RayDMatrix
object.
The RayDMatrix
lazy loads data and stores it sharded in the
Ray object store. The Ray LightGBM actors then access these
shards to run their training on.
A RayDMatrix
support various data and file types, like
Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.
Example loading multiple parquet files:
import glob
from lightgbm_ray import RayDMatrix, RayFileType
# We can also pass a list of files
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))
# This argument will be passed to `pd.read_parquet()`
columns = [
"passenger_count",
"trip_distance", "pickup_longitude", "pickup_latitude",
"dropoff_longitude", "dropoff_latitude",
"fare_amount", "extra", "mta_tax", "tip_amount",
"tolls_amount", "total_amount"
]
dtrain = RayDMatrix(
path,
label="passenger_count", # Will select this column as the label
columns=columns,
# ignore=["total_amount"], # Optional list of columns to ignore
filetype=RayFileType.PARQUET)
Hyperparameter Tuning¶
LightGBM-Ray integrates with Ray Tune to provide distributed hyperparameter tuning for your
distributed LightGBM models. You can run multiple LightGBM-Ray training runs in parallel, each with a different
hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training
code to a function, and pass the function to tune.run
. Internally, train
will detect if tune
is being used and will
automatically report results to tune.
Example using LightGBM-Ray with Ray Tune:
from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer
num_actors = 2
num_cpus_per_actor = 2
ray_params = RayParams(
num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)
def train_model(config):
train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)
evals_result = {}
bst = train(
params=config,
dtrain=train_set,
evals_result=evals_result,
valid_sets=[train_set],
valid_names=["train"],
verbose_eval=False,
ray_params=ray_params)
bst.booster_.save_model("model.lgbm")
from ray import tune
# Specify the hyperparameter search space.
config = {
"objective": "binary",
"metric": ["binary_logloss", "binary_error"],
"eta": tune.loguniform(1e-4, 1e-1),
"subsample": tune.uniform(0.5, 1.0),
"max_depth": tune.randint(1, 9)
}
# Make sure to use the `get_tune_resources` method to set the `resources_per_trial`
analysis = tune.run(
train_model,
config=config,
metric="train-binary_error",
mode="min",
num_samples=4,
resources_per_trial=ray_params.get_tune_resources())
print("Best hyperparameters", analysis.best_config)
Also see examples/simple_tune.py for another example.
Fault tolerance¶
LightGBM-Ray leverages the stateful Ray actor model to enable fault tolerant training. Currently, only non-elastic training is supported.
Non-elastic training (warm restart)¶
When an actor or node dies, LightGBM-Ray will retain the state of the remaining actors. In non-elastic training, the failed actors will be replaced as soon as resources are available again. Only these actors will reload their parts of the data. Training will resume once all actors are ready for training again.
You can configure this mode in the RayParams
:
from lightgbm_ray import RayParams
ray_params = RayParams(
max_actor_restarts=2, # How often are actors allowed to fail, Default = 0
)
Resources¶
By default, LightGBM-Ray tries to determine the number of CPUs available and distributes them evenly across actors.
It is important to note that distributed LightGBM needs at least
two CPUs per actor to function efficiently (without blocking).
Therefore, by default, at least two CPUs will be assigned to each actor,
and an exception will be raised if an actor has less than two CPUs.
It is possible to override this check by setting the
allow_less_than_two_cpus
argument to True
, though it is not
recommended, as it will negatively impact training performance.
In the case of very large clusters or clusters with many different
machine sizes, it makes sense to limit the number of CPUs per actor
by setting the cpus_per_actor
argument. Consider always
setting this explicitly.
The number of LightGBM actors always has to be set manually with
the num_actors
argument.
Multi GPU training¶
LightGBM-Ray enables multi GPU training. The LightGBM core backend
will automatically handle communication.
All you have to do is to start one actor per GPU and set LightGBM’s
device_type
to a GPU-compatible option, eg. gpu
(see LightGBM
documentation for more details.)
For instance, if you have 2 machines with 4 GPUs each, you will want
to start 8 remote actors, and set gpus_per_actor=1
. There is usually
no benefit in allocating less (e.g. 0.5) or more than one GPU per actor.
You should divide the CPUs evenly across actors per machine, so if your machines have 16 CPUs in addition to the 4 GPUs, each actor should have 4 CPUs to use.
from lightgbm_ray import RayParams
ray_params = RayParams(
num_actors=8,
gpus_per_actor=1,
cpus_per_actor=4, # Divide evenly across actors per machine
)
How many remote actors should I use?¶
This depends on your workload and your cluster setup. Generally there is no inherent benefit of running more than one remote actor per node for CPU-only training. This is because LightGBM core can already leverage multiple CPUs via threading.
However, there are some cases when you should consider starting more than one actor per node:
For multi GPU training, each GPU should have a separate remote actor. Thus, if your machine has 24 CPUs and 4 GPUs, you will want to start 4 remote actors with 6 CPUs and 1 GPU each
In a heterogeneous cluster, you might want to find the greatest common divisor for the number of CPUs. E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively, you should set the number of actors to 6 and the CPUs per actor to 4.
Distributed data loading¶
LightGBM-Ray can leverage both centralized and distributed data loading.
In centralized data loading, the data is partitioned by the head node and stored in the object store. Each remote actor then retrieves their partitions by querying the Ray object store. Centralized loading is used when you pass centralized in-memory dataframes, such as Pandas dataframes or Numpy arrays, or when you pass a single source file, such as a single CSV or Parquet file.
from lightgbm_ray import RayDMatrix
# This will use centralized data loading, as only one source file is specified
# `label_col` is a column in the CSV, used as the target label
ray_params = RayDMatrix("./source_file.csv", label="label_col")
In distributed data loading, each remote actor loads their data directly from the source (e.g. local hard disk, NFS, HDFS, S3), without a central bottleneck. The data is still stored in the object store, but locally to each actor. This mode is used automatically when loading data from multiple CSV or Parquet files. Please note that we do not check or enforce partition sizes in this case - it is your job to make sure the data is evenly distributed across the source files.
from lightgbm_ray import RayDMatrix
# This will use distributed data loading, as four source files are specified
# Please note that you cannot schedule more than four actors in this case.
# `label_col` is a column in the Parquet files, used as the target label
ray_params = RayDMatrix([
"hdfs:///tmp/part1.parquet",
"hdfs:///tmp/part2.parquet",
"hdfs:///tmp/part3.parquet",
"hdfs:///tmp/part4.parquet",
], label="label_col")
Lastly, LightGBM-Ray supports distributed dataframe representations, such as Ray Datasets, Modin and Dask dataframes (used with Dask on Ray). Here, LightGBM-Ray will check on which nodes the distributed partitions are currently located, and will assign partitions to actors in order to minimize cross-node data transfer. Please note that we also assume here that partition sizes are uniform.
from lightgbm_ray import RayDMatrix
# This will try to allocate the existing Modin partitions
# to co-located Ray actors. If this is not possible, data will
# be transferred across nodes
ray_params = RayDMatrix(existing_modin_df)
Data sources¶
The following data sources can be used with a RayDMatrix
object.
Type |
Centralized loading |
Distributed loading |
---|---|---|
Numpy array |
Yes |
No |
Pandas dataframe |
Yes |
No |
Single CSV |
Yes |
No |
Multi CSV |
Yes |
Yes |
Single Parquet |
Yes |
No |
Multi Parquet |
Yes |
Yes |
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
|
Yes |
Yes |
Memory usage¶
Details coming soon.
Best practices
In order to reduce peak memory usage, consider the following suggestions:
Store data as
float32
or less. More precision is often not needed, and keeping data in a smaller format will help reduce peak memory usage for initial data loading.Pass the
dtype
when loading data from CSV. Otherwise, floating point values will be loaded asnp.float64
per default, increasing peak memory usage by 33%.
Placement Strategies¶
LightGBM-Ray leverages Ray’s Placement Group API (https://docs.ray.io/en/master/placement-group.html) to implement placement strategies for better fault tolerance.
By default, a SPREAD strategy is used for training, which attempts to spread all of the training workers
across the nodes in a cluster on a best-effort basis. This improves fault tolerance since it minimizes the
number of worker failures when a node goes down, but comes at a cost of increased inter-node communication
To disable this strategy, set the RXGB_USE_SPREAD_STRATEGY
environment variable to 0. If disabled, no
particular placement strategy will be used.
When LightGBM-Ray is used with Ray Tune for hyperparameter tuning, a PACK strategy is used. This strategy attempts to place all workers for each trial on the same node on a best-effort basis. This means that if a node goes down, it will be less likely to impact multiple trials.
When placement strategies are used, LightGBM-Ray will wait for 100 seconds for the required resources
to become available, and will fail if the required resources cannot be reserved and the cluster cannot autoscale
to increase the number of resources. You can change the RXGB_PLACEMENT_GROUP_TIMEOUT_S
environment variable to modify
how long this timeout should be.
More examples¶
For complete end to end examples, please have a look at the examples folder:
Simple sklearn breastcancer dataset example (requires
sklearn
)HIGGS classification example with Parquet (uses the same dataset)
Test data classification (uses a self-generated dataset)
API reference¶
- class lightgbm_ray.RayParams(num_actors: int = 0, cpus_per_actor: int = 0, gpus_per_actor: int = - 1, resources_per_actor: Optional[Dict] = None, elastic_training: bool = False, max_failed_actors: int = 0, max_actor_restarts: int = 0, checkpoint_frequency: int = 5, distributed_callbacks: Optional[List[xgboost_ray.callback.DistributedCallback]] = None, allow_less_than_two_cpus: bool = False)[source]¶
Parameters to configure Ray-specific behavior.
- Parameters
num_actors (int) – Number of parallel Ray actors.
cpus_per_actor (int) – Number of CPUs to be used per Ray actor. If smaller than 2, training might be substantially slower because communication work and training work will block each other. This will raise an exception unless allow_less_than_two_cpus is True.
gpus_per_actor (int) – Number of GPUs to be used per Ray actor.
resources_per_actor (Optional[Dict]) – Dict of additional resources required per Ray actor.
allow_less_than_two_cpus (bool) – If True, an exception will not be raised if cpus_per_actor. Default False.
max_failed_actors (int) – If elastic_training is True, this specifies the maximum number of failed actors with which we still continue training.
max_actor_restarts (int) – Number of retries when Ray actors fail. Defaults to 0 (no retries). Set to -1 for unlimited retries.
checkpoint_frequency (int) – How often to save checkpoints. Defaults to
5
(every 5th iteration).
PublicAPI (beta): This API is in beta and may change before becoming stable.
Note
The xgboost_ray.RayDMatrix
class is shared with XGBoost-Ray.
- class xgboost_ray.RayDMatrix(data: Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset], label: Optional[Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset]] = None, weight: Optional[Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset]] = None, base_margin: Optional[Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset]] = None, missing: Optional[float] = None, label_lower_bound: Optional[Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset]] = None, label_upper_bound: Optional[Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset]] = None, feature_names: Optional[List[str]] = None, feature_types: Optional[List[numpy.dtype]] = None, qid: Optional[Union[str, List[str], numpy.ndarray, pandas.core.frame.DataFrame, pandas.core.series.Series, ray.util.data.dataset.MLDataset]] = None, num_actors: Optional[int] = None, filetype: Optional[xgboost_ray.data_sources.data_source.RayFileType] = None, ignore: Optional[List[str]] = None, distributed: Optional[bool] = None, sharding: xgboost_ray.matrix.RayShardingMode = RayShardingMode.INTERLEAVED, lazy: bool = False, **kwargs)[source]
XGBoost on Ray DMatrix class.
This is the data object that the training and prediction functions expect. This wrapper manages distributed data by sharding the data for the workers and storing the shards in the object store.
If this class is called without the
num_actors
argument, it will be lazy loaded. Thus, it will return immediately and only load the data and store it in the Ray object store afterload_data(num_actors)
orget_data(rank, num_actors)
is called.If this class is instantiated with the
num_actors
argument, it will directly load the data and store them in the object store. If this should be deferred, passlazy=True
as an argument.Loading the data will store it in the Ray object store. This object then stores references to the data shards in the Ray object store. Actors can request these shards with the
get_data(rank)
method, returning dataframes according to the actor rank.The total number of actors has to remain constant and cannot be changed once it has been set.
- Parameters
data – Data object. Can be a pandas dataframe, pandas series, numpy array, Ray MLDataset, modin dataframe, string pointing to a csv or parquet file, or list of strings pointing to csv or parquet files.
label – Optional label object. Can be a pandas series, numpy array, modin series, string pointing to a csv or parquet file, or a string indicating the column of the data dataframe that contains the label. If this is not a string it must be of the same type as the data argument.
num_actors – Number of actors to shard this data for. If this is not None, data will be loaded and stored into the object store after initialization. If this is None, it will be set by the
xgboost_ray.train()
function, and it will be loaded and stored in the object store then. Defaults to None (filetype (Optional[RayFileType]) – Type of data to read. This is disregarded if a data object like a pandas dataframe is passed as the
data
argument. For filenames, the filetype is automaticlly detected via the file name (e.g..csv
will be detected asRayFileType.CSV
). Passing this argument will overwrite the detected filename. If the filename cannot be determined from thedata
object, passing this is mandatory. Defaults toNone
(auto detection).ignore (Optional[List[str]]) – Exclude these columns from the dataframe after loading the data.
distributed (Optional[bool]) – If True, use distributed loading (each worker loads a share of the dataset). If False, use central loading (the head node loads the whole dataset and distributed it). If None, auto-detect and default to distributed loading, if possible.
sharding (RayShardingMode) – How to shard the data for different workers.
RayShardingMode.INTERLEAVED
will divide the data per row, i.e. every i-th row will be passed to the first worker, every (i+1)th row to the second worker, etc.RayShardingMode.BATCH
will divide the data in batches, i.e. the first 0-(m-1) rows will be passed to the first worker, the m-(2m-1) rows to the second worker, etc. Defaults toRayShardingMode.INTERLEAVED
. If using distributed data loading, sharding happens on a per-file basis, and not on a per-row basis, i.e. For interleaved every ith file will be passed into the first worker, etc.lazy (bool) – If
num_actors
is passed, setting this toTrue
will defer data loading and storing untilload_data()
orget_data()
is called. Defaults toFalse
.**kwargs – Keyword arguments will be passed to the data loading function. For instance, with
RayFileType.PARQUET
, these arguments will be passed topandas.read_parquet()
.
from xgboost_ray import RayDMatrix, RayFileType files = ["data_one.parquet", "data_two.parquet"] columns = ["feature_1", "feature_2", "label_column"] dtrain = RayDMatrix( files, num_actors=4, # Will shard the data for four workers label="label_column", # Will select this column as the label columns=columns, # Will be passed to `pandas.read_parquet()` filetype=RayFileType.PARQUET)
PublicAPI (beta): This API is in beta and may change before becoming stable.
- load_data(num_actors: Optional[int] = None, rank: Optional[int] = None)[source]
Load data, putting it into the Ray object store.
If a rank is given, only data for this rank is loaded (for distributed data sources only).
- get_data(rank: int, num_actors: Optional[int] = None) Dict[str, Union[None, pandas.core.frame.DataFrame, List[Optional[pandas.core.frame.DataFrame]]]] [source]
Get data, i.e. return dataframe for a specific actor.
This method is called from an actor, given its rank and the total number of actors. If the data is not yet loaded, loading is triggered.
- unload_data()[source]
Delete object references to clear object store
- lightgbm_ray.train(params: Dict, dtrain: xgboost_ray.matrix.RayDMatrix, model_factory: Type[lightgbm.sklearn.LGBMModel] = <class 'lightgbm.sklearn.LGBMModel'>, num_boost_round: int = 10, *args, valid_sets: Optional[List[xgboost_ray.matrix.RayDMatrix]] = None, valid_names: Optional[List[str]] = None, verbose_eval: Union[bool, int] = True, evals: Union[List[Tuple[xgboost_ray.matrix.RayDMatrix, str]], Tuple[xgboost_ray.matrix.RayDMatrix, str]] = (), evals_result: Optional[Dict] = None, additional_results: Optional[Dict] = None, ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, **kwargs) lightgbm.sklearn.LGBMModel [source]¶
Distributed LightGBM training via Ray.
This function will connect to a Ray cluster, create
num_actors
remote actors, send data shards to them, and have them train an LightGBM model using LightGBM’s built-in distributed mode.This method handles setting up the following network parameters: -
local_listen_port
: port that each LightGBM worker opens a listening socket on, to accept connections from other workers. This can differ from LightGBM worker to LightGBM worker, but does not have to. -machines
: a comma-delimited list of all workers in the cluster, in the formip:port,ip:port
. If running multiple workers on the same Ray Node, use different ports for each worker. For example, forray_params.num_actors=3
, you might pass"127.0.0.1:12400,127.0.0.1:12401,127.0.0.1:12402"
.The default behavior of this function is to generate
machines
based on Ray workers, and to search for an open port on each worker to be used aslocal_listen_port
.If
machines
is provided explicitly inparams
, this function uses the hosts and ports in that list directly, and will try to start Ray workers on the nodes with the given ips. If that is not possible, or any of those ports are not free when training starts, training will fail.If
local_listen_port
is provided inparams
andmachines
is not, this function constructsmachines
automatically from auto-assigned Ray workers, assuming that each one will use the samelocal_listen_port
.Failure handling:
LightGBM on Ray supports automatic failure handling that can be configured with the
ray_params
argument. If an actor or local training task dies, the Ray actor is marked as dead and the number of restarts is belowray_params.max_actor_restarts
, Ray will try to schedule the dead actor again, load the data shard on this actor, and then continue training from the latest checkpoint.Otherwise, training is aborted.
- Parameters
params (Dict) – parameter dict passed to
LGBMModel
dtrain (RayDMatrix) – Data object containing the training data.
model_factory (Type[LGBMModel]) –
valid_sets (Optional[List[RayDMatrix]]) – List of data to be evaluated on during training. Mutually exclusive with
evals
.Optional[List[str]] (valid_names) – Names of
valid_sets
.evals (Union[List[Tuple[RayDMatrix, str]], Tuple[RayDMatrix, str]]) –
evals
tuple passed toLGBMModel.fit()
. Mutually exclusive withvalid_sets
.evals_result (Optional[Dict]) – Dict to store evaluation results in.
verbose_eval (Union[bool, int]) – Requires at least one validation data. If True, the eval metric on the valid set is printed at each boosting stage. If int, the eval metric on the valid set is printed at every
verbose_eval
boosting stage. The last boosting stage or the boosting stage found by usingearly_stopping_rounds
is also printed. Withverbose_eval
= 4 and at least one item invalid_sets
, an evaluation metric is printed every 4 (instead of 1) boosting stages.additional_results (Optional[Dict]) – Dict to store additional results.
ray_params (Union[None, lightgbm_ray.RayParams, Dict]) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters._remote (bool) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
**kwargs – Keyword arguments will be passed to the local model_factory.fit() calls.
Returns: An
LGBMModel
object.PublicAPI (beta): This API is in beta and may change before becoming stable.
- lightgbm_ray.predict(model: Union[lightgbm.sklearn.LGBMModel, lightgbm.basic.Booster], data: xgboost_ray.matrix.RayDMatrix, method: str = 'predict', ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, **kwargs) Optional[numpy.ndarray] [source]¶
Distributed LightGBM predict via Ray.
This function will connect to a Ray cluster, create
num_actors
remote actors, send data shards to them, and have them predict labels using an LightGBM model. The results are then combined and returned.- Parameters
model (Union[LGBMModel, Booster]) – Model or booster object to call for prediction.
data (RayDMatrix) – Data object containing the prediction data.
method (str) – Name of estimator method to use for prediction.
ray_params (Union[None, lightgbm_ray.RayParams, Dict]) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters._remote (bool) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
**kwargs – Keyword arguments will be passed to the local xgb.predict() calls.
Returns:
np.ndarray
containing the predicted labels.PublicAPI (beta): This API is in beta and may change before becoming stable.
scikit-learn API¶
- class lightgbm_ray.RayLGBMClassifier(boosting_type: str = 'gbdt', num_leaves: int = 31, max_depth: int = - 1, learning_rate: float = 0.1, n_estimators: int = 100, subsample_for_bin: int = 200000, objective: Optional[Union[str, Callable]] = None, class_weight: Optional[Union[Dict, str]] = None, min_split_gain: float = 0.0, min_child_weight: float = 0.001, min_child_samples: int = 20, subsample: float = 1.0, subsample_freq: int = 0, colsample_bytree: float = 1.0, reg_alpha: float = 0.0, reg_lambda: float = 0.0, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None, n_jobs: int = - 1, silent: Union[bool, str] = 'warn', importance_type: str = 'split', **kwargs)[source]¶
PublicAPI (beta): This API is in beta and may change before becoming stable.
- fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names: Optional[List[str]] = None, eval_sample_weight=None, eval_class_weight: Optional[List[Union[dict, str]]] = None, eval_init_score=None, eval_metric: Optional[Union[Callable, str, List[Union[Callable, str]]]] = None, ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, ray_dmatrix_params: Optional[Dict] = None, **kwargs: Any) lightgbm_ray.sklearn.RayLGBMClassifier [source]¶
Build a gradient boosting model from the training set (X, y).
- Parameters
X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.
y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).
sample_weight (array-like of shape = [n_samples] or None, optional (default=None)) – Weights of training data.
init_score (array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) or shape = [n_samples, n_classes] (for multi-class task) or None, optional (default=None)) – Init score of training data.
eval_set (list or None, optional (default=None)) – A list of (X, y) tuple pairs to use as validation sets.
eval_names (list of str, or None, optional (default=None)) – Names of eval_set.
eval_sample_weight (list of array, or None, optional (default=None)) – Weights of eval data.
eval_class_weight (list or None, optional (default=None)) – Class weights of eval data.
eval_init_score (list of array, or None, optional (default=None)) – Init score of eval data.
eval_metric (str, callable, list or None, optional (default=None)) – If str, it should be a built-in evaluation metric to use. If callable, it should be a custom evaluation metric, see note below for more details. If list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix of both. In either case, the
metric
from the model parameters will be evaluated and used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’ for LGBMRanker.early_stopping_rounds (int or None, optional (default=None)) – Activates early stopping. The model will train until the validation score stops improving. Validation score needs to improve at least every
early_stopping_rounds
round(s) to continue training. Requires at least one validation data and one metric. If there’s more than one, will check all of them. But the training data is ignored anyway. To check only the first metric, set thefirst_metric_only
parameter toTrue
in additional parameters**kwargs
of the model constructor.verbose (bool or int, optional (default=True)) –
Requires at least one evaluation data. If True, the eval metric on the eval set is printed at each boosting stage. If int, the eval metric on the eval set is printed at every
verbose
boosting stage. The last boosting stage or the boosting stage found by usingearly_stopping_rounds
is also printed.Example
With
verbose
= 4 and at least one item ineval_set
, an evaluation metric is printed every 4 (instead of 1) boosting stages.feature_name (list of str, or 'auto', optional (default='auto')) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify
feature_name
as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.callbacks (list of callable, or None, optional (default=None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.
init_model (str, pathlib.Path, Booster, LGBMModel or None, optional (default=None)) – Filename of LightGBM model, Booster instance or LGBMModel instance used for continue training.
ray_params (lightgbm_ray.RayParams or dict, optional (default=None)) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters. Will overriden_jobs
attribute with ownnum_actors
parameter._remote (bool, optional (default=False)) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
ray_dmatrix_params (dict, optional (default=None)) – Dict of parameters (such as sharding mode) passed to the internal RayDMatrix initialization.
- Returns
self – Returns self.
- Return type
object
Note
Custom eval function expects a callable with following signatures:
func(y_true, y_pred)
,func(y_true, y_pred, weight)
orfunc(y_true, y_pred, weight, group)
and returns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):- y_truearray-like of shape = [n_samples]
The target values.
- y_predarray-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values. In case of custom
objective
, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.- weightarray-like of shape = [n_samples]
The weight of samples.
- grouparray-like
Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with
group = [10, 20, 40, 10, 10, 10]
, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.- eval_namestr
The name of evaluation function (without whitespace).
- eval_resultfloat
The eval result.
- is_higher_betterbool
Is eval result higher better, e.g. AUC is
is_higher_better
.
For multi-class task, the y_pred is group by class_id first, then group by row_id. If you want to get i-th row y_pred in j-th class, the access way is y_pred[j * num_data + i].
- predict_proba(X, *, ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, ray_dmatrix_params: Optional[Dict] = None, **kwargs)[source]¶
Return the predicted probability for each class for each sample.
- Parameters
X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input features matrix.
raw_score (bool, optional (default=False)) – Whether to predict raw scores.
start_iteration (int, optional (default=0)) – Start index of the iteration to predict. If <= 0, starts from the first iteration.
num_iteration (int or None, optional (default=None)) – Total number of iterations used in the prediction. If None, if the best iteration exists and start_iteration <= 0, the best iteration is used; otherwise, all iterations from
start_iteration
are used (no limits). If <= 0, all iterations fromstart_iteration
are used (no limits).pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) –
Whether to predict feature contributions.
Note
If you want to get more explanations for your model’s predictions using SHAP values, like SHAP interaction values, you can install the shap package (https://github.com/slundberg/shap). Note that unlike the shap package, with
pred_contrib
we return a matrix with an extra column, where the last column is the expected value.ray_params (lightgbm_ray.RayParams or dict, optional (default=None)) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters. Will overriden_jobs
attribute with ownnum_actors
parameter._remote (bool, optional (default=False)) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
ray_dmatrix_params (dict, optional (default=None)) – Dict of parameters (such as sharding mode) passed to the internal RayDMatrix initialization.
**kwargs – Other parameters for the prediction.
- Returns
predicted_probability (array-like of shape = [n_samples] or shape = [n_samples, n_classes]) – The predicted values.
X_leaves (array-like of shape = [n_samples, n_trees] or shape = [n_samples, n_trees * n_classes]) – If
pred_leaf=True
, the predicted leaf of every tree for each sample.X_SHAP_values (array-like of shape = [n_samples, n_features + 1] or shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects) – If
pred_contrib=True
, the feature contributions for each sample.
- predict(X, *, ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, ray_dmatrix_params: Optional[Dict] = None, **kwargs)[source]¶
Return the predicted value for each sample.
- Parameters
X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input features matrix.
raw_score (bool, optional (default=False)) – Whether to predict raw scores.
start_iteration (int, optional (default=0)) – Start index of the iteration to predict. If <= 0, starts from the first iteration.
num_iteration (int or None, optional (default=None)) – Total number of iterations used in the prediction. If None, if the best iteration exists and start_iteration <= 0, the best iteration is used; otherwise, all iterations from
start_iteration
are used (no limits). If <= 0, all iterations fromstart_iteration
are used (no limits).pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) –
Whether to predict feature contributions.
Note
If you want to get more explanations for your model’s predictions using SHAP values, like SHAP interaction values, you can install the shap package (https://github.com/slundberg/shap). Note that unlike the shap package, with
pred_contrib
we return a matrix with an extra column, where the last column is the expected value.ray_params (lightgbm_ray.RayParams or dict, optional (default=None)) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters. Will overriden_jobs
attribute with ownnum_actors
parameter._remote (bool, optional (default=False)) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
ray_dmatrix_params (dict, optional (default=None)) – Dict of parameters (such as sharding mode) passed to the internal RayDMatrix initialization.
**kwargs – Other parameters for the prediction.
- Returns
predicted_result (array-like of shape = [n_samples] or shape = [n_samples, n_classes]) – The predicted values.
X_leaves (array-like of shape = [n_samples, n_trees] or shape = [n_samples, n_trees * n_classes]) – If
pred_leaf=True
, the predicted leaf of every tree for each sample.X_SHAP_values (array-like of shape = [n_samples, n_features + 1] or shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects) – If
pred_contrib=True
, the feature contributions for each sample.
- class lightgbm_ray.RayLGBMRegressor(boosting_type: str = 'gbdt', num_leaves: int = 31, max_depth: int = - 1, learning_rate: float = 0.1, n_estimators: int = 100, subsample_for_bin: int = 200000, objective: Optional[Union[str, Callable]] = None, class_weight: Optional[Union[Dict, str]] = None, min_split_gain: float = 0.0, min_child_weight: float = 0.001, min_child_samples: int = 20, subsample: float = 1.0, subsample_freq: int = 0, colsample_bytree: float = 1.0, reg_alpha: float = 0.0, reg_lambda: float = 0.0, random_state: Optional[Union[int, numpy.random.mtrand.RandomState]] = None, n_jobs: int = - 1, silent: Union[bool, str] = 'warn', importance_type: str = 'split', **kwargs)[source]¶
PublicAPI (beta): This API is in beta and may change before becoming stable.
- fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names: Optional[List[str]] = None, eval_sample_weight=None, eval_init_score=None, eval_metric: Optional[Union[Callable, str, List[Union[Callable, str]]]] = None, ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, ray_dmatrix_params: Optional[Dict] = None, **kwargs: Any) lightgbm_ray.sklearn.RayLGBMRegressor [source]¶
Build a gradient boosting model from the training set (X, y).
- Parameters
X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.
y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).
sample_weight (array-like of shape = [n_samples] or None, optional (default=None)) – Weights of training data.
init_score (array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) or shape = [n_samples, n_classes] (for multi-class task) or None, optional (default=None)) – Init score of training data.
eval_set (list or None, optional (default=None)) – A list of (X, y) tuple pairs to use as validation sets.
eval_names (list of str, or None, optional (default=None)) – Names of eval_set.
eval_sample_weight (list of array, or None, optional (default=None)) – Weights of eval data.
eval_init_score (list of array, or None, optional (default=None)) – Init score of eval data.
eval_metric (str, callable, list or None, optional (default=None)) – If str, it should be a built-in evaluation metric to use. If callable, it should be a custom evaluation metric, see note below for more details. If list, it can be a list of built-in metrics, a list of custom evaluation metrics, or a mix of both. In either case, the
metric
from the model parameters will be evaluated and used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’ for LGBMRanker.early_stopping_rounds (int or None, optional (default=None)) – Activates early stopping. The model will train until the validation score stops improving. Validation score needs to improve at least every
early_stopping_rounds
round(s) to continue training. Requires at least one validation data and one metric. If there’s more than one, will check all of them. But the training data is ignored anyway. To check only the first metric, set thefirst_metric_only
parameter toTrue
in additional parameters**kwargs
of the model constructor.verbose (bool or int, optional (default=True)) –
Requires at least one evaluation data. If True, the eval metric on the eval set is printed at each boosting stage. If int, the eval metric on the eval set is printed at every
verbose
boosting stage. The last boosting stage or the boosting stage found by usingearly_stopping_rounds
is also printed.Example
With
verbose
= 4 and at least one item ineval_set
, an evaluation metric is printed every 4 (instead of 1) boosting stages.feature_name (list of str, or 'auto', optional (default='auto')) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify
feature_name
as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.callbacks (list of callable, or None, optional (default=None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.
init_model (str, pathlib.Path, Booster, LGBMModel or None, optional (default=None)) – Filename of LightGBM model, Booster instance or LGBMModel instance used for continue training.
ray_params (lightgbm_ray.RayParams or dict, optional (default=None)) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters. Will overriden_jobs
attribute with ownnum_actors
parameter._remote (bool, optional (default=False)) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
ray_dmatrix_params (dict, optional (default=None)) – Dict of parameters (such as sharding mode) passed to the internal RayDMatrix initialization.
- Returns
self – Returns self.
- Return type
object
Note
Custom eval function expects a callable with following signatures:
func(y_true, y_pred)
,func(y_true, y_pred, weight)
orfunc(y_true, y_pred, weight, group)
and returns (eval_name, eval_result, is_higher_better) or list of (eval_name, eval_result, is_higher_better):- y_truearray-like of shape = [n_samples]
The target values.
- y_predarray-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values. In case of custom
objective
, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.- weightarray-like of shape = [n_samples]
The weight of samples.
- grouparray-like
Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with
group = [10, 20, 40, 10, 10, 10]
, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.- eval_namestr
The name of evaluation function (without whitespace).
- eval_resultfloat
The eval result.
- is_higher_betterbool
Is eval result higher better, e.g. AUC is
is_higher_better
.
For multi-class task, the y_pred is group by class_id first, then group by row_id. If you want to get i-th row y_pred in j-th class, the access way is y_pred[j * num_data + i].
- predict(X, *, ray_params: Union[None, lightgbm_ray.main.RayParams, Dict] = None, _remote: Optional[bool] = None, ray_dmatrix_params: Optional[Dict] = None, **kwargs)[source]¶
Return the predicted value for each sample.
- Parameters
X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input features matrix.
raw_score (bool, optional (default=False)) – Whether to predict raw scores.
start_iteration (int, optional (default=0)) – Start index of the iteration to predict. If <= 0, starts from the first iteration.
num_iteration (int or None, optional (default=None)) – Total number of iterations used in the prediction. If None, if the best iteration exists and start_iteration <= 0, the best iteration is used; otherwise, all iterations from
start_iteration
are used (no limits). If <= 0, all iterations fromstart_iteration
are used (no limits).pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) –
Whether to predict feature contributions.
Note
If you want to get more explanations for your model’s predictions using SHAP values, like SHAP interaction values, you can install the shap package (https://github.com/slundberg/shap). Note that unlike the shap package, with
pred_contrib
we return a matrix with an extra column, where the last column is the expected value.ray_params (lightgbm_ray.RayParams or dict, optional (default=None)) – Parameters to configure Ray-specific behavior. See
lightgbm_ray.RayParams
for a list of valid configuration parameters. Will overriden_jobs
attribute with ownnum_actors
parameter._remote (bool, optional (default=False)) – Whether to run the driver process in a remote function. This is enabled by default in Ray client mode.
ray_dmatrix_params (dict, optional (default=None)) – Dict of parameters (such as sharding mode) passed to the internal RayDMatrix initialization.
**kwargs – Other parameters for the prediction.
- Returns
predicted_result (array-like of shape = [n_samples] or shape = [n_samples, n_classes]) – The predicted values.
X_leaves (array-like of shape = [n_samples, n_trees] or shape = [n_samples, n_trees * n_classes]) – If
pred_leaf=True
, the predicted leaf of every tree for each sample.X_SHAP_values (array-like of shape = [n_samples, n_features + 1] or shape = [n_samples, (n_features + 1) * n_classes] or list with n_classes length of such objects) – If
pred_contrib=True
, the feature contributions for each sample.