A side-by-side reference of decision trees, random forests, gradient boosting (sklearn), and XGBoost — grouped by functional category, with a separate tuning-priority view that shows which knobs actually move the needle for each model.
Every common hyperparameter, grouped by what it actually does. Empty cells mean the parameter doesn't exist in that model. Where names differ across libraries, the model-specific name is shown.
| Parameter | Purpose | DT | RF | GBM | XGB |
|---|---|---|---|---|---|
| Core / ensemble structure | |||||
| n_estimators | Number of trees in the ensemble | — | n_estimators |
n_estimators |
n_estimators |
| learning_rate | Step size scaling each tree's contribution | — | — | learning_rate |
learning_rate / eta |
| booster type | Base learner family (tree / linear / dart) | — | — | — | booster |
| init estimator / F₀ | Initial constant prediction baseline | — | — | init |
base_score |
| Splitting criterion | |||||
| split criterion | Impurity / quality measure for picking splits | criterion |
criterion |
criterion |
objective (built into loss) |
| splitter strategy | Best vs random feature/threshold search | splitter |
— | — | tree_method (hist/exact/approx) |
| grow policy | Depth-wise vs best-first leaf expansion | — | — | — | grow_policy |
| Pre-pruning / tree shape | |||||
| max_depth | Maximum tree depth | max_depth |
max_depth |
max_depth |
max_depth |
| max_leaf_nodes / max_leaves | Cap on total leaves (alternative to depth) | max_leaf_nodes |
max_leaf_nodes |
max_leaf_nodes |
max_leaves |
| min_samples_split | Min samples required to consider a split | min_samples_split |
min_samples_split |
min_samples_split |
— |
| min_samples_leaf / min_child_weight | Min samples / Hessian sum required per leaf | min_samples_leaf |
min_samples_leaf |
min_samples_leaf |
min_child_weight |
| min_weight_fraction_leaf | Weight-based leaf size constraint | min_weight_fraction_leaf |
min_weight_fraction_leaf |
min_weight_fraction_leaf |
— |
| min_impurity_decrease / gamma | Min gain required to allow a split | min_impurity_decrease |
min_impurity_decrease |
min_impurity_decrease |
gamma |
| Feature subsampling | |||||
| max_features | Features considered per split | max_features |
max_features |
max_features |
colsample_bynode |
| features per tree | Features sampled once per tree | — | — | — | colsample_bytree |
| features per level | Features sampled per tree depth level | — | — | — | colsample_bylevel |
| Row subsampling (stochasticity) | |||||
| bootstrap | Sample rows with replacement per tree | — | bootstrap |
— | — |
| max_samples / subsample | Fraction of rows used per tree | — | max_samples |
subsample |
subsample |
| oob_score | Use out-of-bag samples for free validation | — | oob_score |
— | — |
| Regularization on leaf values | |||||
| L2 on leaf weights | Shrinks leaf magnitudes toward zero | — | — | — | reg_lambda |
| L1 on leaf weights | Sparsity on leaf values | — | — | — | reg_alpha |
| cost-complexity pruning | Post-pruning by leaf cost | ccp_alpha |
ccp_alpha |
ccp_alpha |
— |
| max_delta_step | Caps leaf weight change per update (imbalance) | — | — | — | max_delta_step |
| Loss / objective | |||||
| loss function | What's being optimized | — | — | loss |
objective |
| eval metric | Metric monitored during training | — | — | — | eval_metric |
| Class balance | |||||
| class_weight | Per-class weighting in impurity | class_weight |
class_weight |
— | — |
| scale_pos_weight | Gradient scaling for positive class (binary) | — | — | — | scale_pos_weight |
| sample_weight | Per-row weights passed to fit() |
✓ (fit arg) | ✓ (fit arg) | ✓ (fit arg) | ✓ (fit arg) |
| Early stopping | |||||
| validation_fraction | Hold-out fraction for early-stop monitoring | — | — | validation_fraction |
— |
| n_iter_no_change / early_stopping_rounds | Patience before stopping | — | — | n_iter_no_change |
early_stopping_rounds |
| tol | Min improvement to count as progress | — | — | tol |
— |
| Constraints (domain knowledge) | |||||
| monotonic_cst | Force monotonic increase/decrease per feature | monotonic_cst |
monotonic_cst |
— | monotone_constraints |
| interaction_constraints | Restrict which features can interact | — | — | — | interaction_constraints |
| Compute / system | |||||
| n_jobs | Parallel CPU threads | — | n_jobs |
— | n_jobs / nthread |
| device / GPU | CPU vs GPU training | — | — | — | device |
| tree_method | Split-finding algorithm (exact / hist / approx) | — | — | — | tree_method |
| random_state | Reproducibility seed | random_state |
random_state |
random_state |
random_state |
| warm_start | Incremental fitting (add more estimators) | — | warm_start |
warm_start |
— |
| verbose | Training progress output | verbose |
verbose |
verbose |
verbosity |
| Missing values | |||||
| missing handling | Native NaN handling at split time | manual | manual | manual | missing (native) |
Not all parameters matter equally — and a parameter that's critical for one model may be irrelevant for another. This table ranks each parameter by how much it actually affects performance in each model, with typical tuning ranges.
| Parameter | DT | RF | GBM | XGB | Typical range | Why it matters where it matters |
|---|---|---|---|---|---|---|
| n_estimators | — | Medium | Critical | Critical | RF: 200–500 Boost: 1000–5000 + ES |
In RF more = better with diminishing returns. In boosting it's the regularizer — too many overfits; use early stopping. |
| learning_rate | — | — | Critical | Critical | 0.01–0.1 | Lower lr + more trees nearly always beats higher lr + fewer trees. Pairs with n_estimators. |
| max_depth | Critical | Low | High | Critical | DT: 3–20 (CV) RF: None Boost: 3–8 |
DT's main overfit lever. RF wants deep trees (averaging fixes it). Boosting wants shallow trees (correction model). |
| min_samples_leaf / min_child_weight | High | Medium | Medium | High | 1, 5, 10, 20+ | Often more effective than max_depth for DT. In XGB it's in Hessian-space — the second tuning lever after depth. |
| min_samples_split | Medium | Low | Low | — | 2, 10, 20 | min_samples_leaf is usually the better lever; this one is redundant most of the time. |
| min_impurity_decrease / gamma | Low | Low | Low | High | XGB gamma: 0–5 | In XGB this is built into the split-gain math and meaningfully prunes weak splits. Elsewhere mostly ignored. |
| max_features (per split) | Low | Critical | Medium | Medium | RF: sqrt, log2, 0.3–0.5 XGB colsample_bynode |
Reducing it hurts a single tree but is the core of RF's value (decorrelation). Useful as light regularization in boosting. |
| colsample_bytree | — | — | — | High | 0.6–1.0 | XGB's per-tree feature sampling. Standard regularization knob for boosting on wider feature sets. |
| subsample / max_samples | — | Low | High | High | 0.6–0.9 | Stochastic GBM trick — adds row-level randomness, reduces overfitting, often improves generalization at low cost. |
| bootstrap | — | Low | — | — | True (default) | Almost always leave True. Turning off makes RF behave like Extra Trees. |
| reg_lambda (L2) | — | — | — | High | 0.1–10 (log scale) | XGB-specific. Shrinks leaf weights directly. Effective when overfitting persists despite depth/subsampling. |
| reg_alpha (L1) | — | — | — | Medium | 0, 0.01–1 | Useful for very high-dim data or when you want sparsity in leaf magnitudes. Usually leave at 0. |
| ccp_alpha | Critical | Low | Low | — | Use pruning path | For a single DT this is the cleanest principled way to size the tree. Redundant in ensembles. |
| criterion | Low | Low | Low | Low | Default fine | Gini vs entropy / mse vs friedman_mse: rarely move the needle. Spend cycles elsewhere. |
| splitter | Low | — | — | — | "best" | "random" only useful when hand-rolling ensembles; for a single DT always "best". |
| loss / objective | — | — | Medium | Medium | Match task | Defaults handle most cases. Choose carefully when you care about robust regression (huber, mae) or ranking. |
| eval_metric | — | — | — | High | Match business metric | Drives early stopping — set this to what you actually care about (AUC vs logloss matters). |
| class_weight / scale_pos_weight | High | High | — | High | "balanced" or neg/pos ratio | Always relevant on imbalanced data. GBM uses sample_weight instead. |
| early_stopping | — | — | Critical | Critical | patience 20–50 | Essentially tunes n_estimators for free. Always use in boosting; doesn't apply to bagging. |
| tree_method | — | — | — | Medium | "hist" for large data | Not accuracy tuning per se, but "hist" is 5–10× faster on large data with almost no accuracy loss. |
| oob_score | — | Medium | — | — | True (when useful) | Not performance tuning — but a free validation estimate, often replaces a CV loop. |
| monotonic constraints | Low | Low | — | Low | Domain-driven | Not about accuracy — about model trust, fairness, and regulatory compliance. Set when domain knowledge demands it. |
| random_state | Low | Low | Low | Low | Any fixed int | Set for reproducibility, but don't search over it — that's overfitting to seed. |
A practical sequence to follow when tuning each model — what to set first, second, third. Stop when accuracy plateaus.
cost_complexity_pruning_path() to get candidate ccp_alphasccp_alpha — this principled axis is usually enoughmax_depth × min_samples_leafclass_weight="balanced" if imbalancedcriterion, splitter, max_features at defaultsn_estimators: set to 300–500, don't tunemax_features: actually tune — try sqrt, log2, 0.3, 0.5min_samples_leaf: try 1, 2, 5, 10max_depth: usually leave at None; constrain only if overfittingclass_weight="balanced_subsample" if imbalancedoob_score=True to skip a CV looplearning_rate=0.05, n_estimators=1000+n_iter_no_change=20, validation_fraction=0.15max_depth: try 3, 5, 7min_samples_leaf: 1, 5, 20subsample=0.8, max_features="sqrt"loss to your task (huber for outliers, etc.)learning_rate=0.05, n_estimators=2000, early_stopping_rounds=50eval_metric to your true target (e.g. "auc")max_depth (3–9) and min_child_weight (1–20)subsample and colsample_bytree (both 0.6–1.0)reg_lambda (0.1–10, log scale) and gamma (0–5)tree_method="hist" for speed on large datascale_pos_weight if imbalanced binaryThe single most-misunderstood failure mode of tree-based models — including random forests and gradient boosting. It is not linear interpolation. It is not graceful degradation. It is something cruder, and you need to know exactly what.
| discount | training lift | note |
|---|---|---|
| 10 % | 5 % | seen |
| 15 % | 12 % | seen |
| 20 % | 18 % | seen |
| 25 % | 25 % | seen — training max |
| 40 % | ? | extrapolation request |
discount < 12.5 → predict 512.5 ≤ discount < 17.5 → predict 1217.5 ≤ discount < 22.5 → predict 18discount ≥ 22.5 → predict 25 ← boundary leafThe tree predicts exactly the same lift for a 40 % promo as for a 25 % promo — because 40 % falls into the same boundary leaf. Linear regression on the same four points (slope ≈ 1.3, intercept ≈ −8) at least scales with the input: 1.3 × 40 − 8 ≈ 44.
lift / baseline_traffic rather than absolute lift.)[min, max] per feature and refuse to predict (or fall back to a simpler model).monotone_constraints). Forces the tree to be monotone in chosen features. Doesn't fix flat-extrapolation but at least prevents the absurd cases where a deeper boundary leaf decreases the prediction.RandomForestRegressor, GradientBoostingRegressor, XGBRegressor, LGBMRegressor — averaging or boosting a bunch of step-function predictors still gives you a step-function predictor.