Tree-based model hyperparameters

A side-by-side reference of decision trees, random forests, gradient boosting (sklearn), and XGBoost — grouped by functional category, with a separate tuning-priority view that shows which knobs actually move the needle for each model.

DT Decision Tree RF Random Forest GBM sklearn Gradient Boosting XGB XGBoost

1Hyperparameter map across models

Every common hyperparameter, grouped by what it actually does. Empty cells mean the parameter doesn't exist in that model. Where names differ across libraries, the model-specific name is shown.

Parameter	Purpose	DT	RF	GBM	XGB
Core / ensemble structure
n_estimators	Number of trees in the ensemble	—	`n_estimators`	`n_estimators`	`n_estimators`
learning_rate	Step size scaling each tree's contribution	—	—	`learning_rate`	`learning_rate` / `eta`
booster type	Base learner family (tree / linear / dart)	—	—	—	`booster`
init estimator / F₀	Initial constant prediction baseline	—	—	`init`	`base_score`
Splitting criterion
split criterion	Impurity / quality measure for picking splits	`criterion`	`criterion`	`criterion`	`objective` (built into loss)
splitter strategy	Best vs random feature/threshold search	`splitter`	—	—	`tree_method` (hist/exact/approx)
grow policy	Depth-wise vs best-first leaf expansion	—	—	—	`grow_policy`
Pre-pruning / tree shape
max_depth	Maximum tree depth	`max_depth`	`max_depth`	`max_depth`	`max_depth`
max_leaf_nodes / max_leaves	Cap on total leaves (alternative to depth)	`max_leaf_nodes`	`max_leaf_nodes`	`max_leaf_nodes`	`max_leaves`
min_samples_split	Min samples required to consider a split	`min_samples_split`	`min_samples_split`	`min_samples_split`	—
min_samples_leaf / min_child_weight	Min samples / Hessian sum required per leaf	`min_samples_leaf`	`min_samples_leaf`	`min_samples_leaf`	`min_child_weight`
min_weight_fraction_leaf	Weight-based leaf size constraint	`min_weight_fraction_leaf`	`min_weight_fraction_leaf`	`min_weight_fraction_leaf`	—
min_impurity_decrease / gamma	Min gain required to allow a split	`min_impurity_decrease`	`min_impurity_decrease`	`min_impurity_decrease`	`gamma`
Feature subsampling
max_features	Features considered per split	`max_features`	`max_features`	`max_features`	`colsample_bynode`
features per tree	Features sampled once per tree	—	—	—	`colsample_bytree`
features per level	Features sampled per tree depth level	—	—	—	`colsample_bylevel`
Row subsampling (stochasticity)
bootstrap	Sample rows with replacement per tree	—	`bootstrap`	—	—
max_samples / subsample	Fraction of rows used per tree	—	`max_samples`	`subsample`	`subsample`
oob_score	Use out-of-bag samples for free validation	—	`oob_score`	—	—
Regularization on leaf values
L2 on leaf weights	Shrinks leaf magnitudes toward zero	—	—	—	`reg_lambda`
L1 on leaf weights	Sparsity on leaf values	—	—	—	`reg_alpha`
cost-complexity pruning	Post-pruning by leaf cost	`ccp_alpha`	`ccp_alpha`	`ccp_alpha`	—
max_delta_step	Caps leaf weight change per update (imbalance)	—	—	—	`max_delta_step`
Loss / objective
loss function	What's being optimized	—	—	`loss`	`objective`
eval metric	Metric monitored during training	—	—	—	`eval_metric`
Class balance
class_weight	Per-class weighting in impurity	`class_weight`	`class_weight`	—	—
scale_pos_weight	Gradient scaling for positive class (binary)	—	—	—	`scale_pos_weight`
sample_weight	Per-row weights passed to `fit()`	✓ (fit arg)	✓ (fit arg)	✓ (fit arg)	✓ (fit arg)
Early stopping
validation_fraction	Hold-out fraction for early-stop monitoring	—	—	`validation_fraction`	—
n_iter_no_change / early_stopping_rounds	Patience before stopping	—	—	`n_iter_no_change`	`early_stopping_rounds`
tol	Min improvement to count as progress	—	—	`tol`	—
Constraints (domain knowledge)
monotonic_cst	Force monotonic increase/decrease per feature	`monotonic_cst`	`monotonic_cst`	—	`monotone_constraints`
interaction_constraints	Restrict which features can interact	—	—	—	`interaction_constraints`
Compute / system
n_jobs	Parallel CPU threads	—	`n_jobs`	—	`n_jobs` / `nthread`
device / GPU	CPU vs GPU training	—	—	—	`device`
tree_method	Split-finding algorithm (exact / hist / approx)	—	—	—	`tree_method`
random_state	Reproducibility seed	`random_state`	`random_state`	`random_state`	`random_state`
warm_start	Incremental fitting (add more estimators)	—	`warm_start`	`warm_start`	—
verbose	Training progress output	`verbose`	`verbose`	`verbose`	`verbosity`
Missing values
missing handling	Native NaN handling at split time	manual	manual	manual	`missing` (native)

2Tuning priority by model

Not all parameters matter equally — and a parameter that's critical for one model may be irrelevant for another. This table ranks each parameter by how much it actually affects performance in each model, with typical tuning ranges.

Critical Tune first, biggest impact High Worth tuning, real gains Medium Minor tuning, edge cases Low Defaults usually fine — Not applicable

Parameter	DT	RF	GBM	XGB	Typical range	Why it matters where it matters
n_estimators	—	Medium	Critical	Critical	RF: 200–500 Boost: 1000–5000 + ES	In RF more = better with diminishing returns. In boosting it's the regularizer — too many overfits; use early stopping.
learning_rate	—	—	Critical	Critical	0.01–0.1	Lower lr + more trees nearly always beats higher lr + fewer trees. Pairs with n_estimators.
max_depth	Critical	Low	High	Critical	DT: 3–20 (CV) RF: None Boost: 3–8	DT's main overfit lever. RF wants deep trees (averaging fixes it). Boosting wants shallow trees (correction model).
min_samples_leaf / min_child_weight	High	Medium	Medium	High	1, 5, 10, 20+	Often more effective than max_depth for DT. In XGB it's in Hessian-space — the second tuning lever after depth.
min_samples_split	Medium	Low	Low	—	2, 10, 20	min_samples_leaf is usually the better lever; this one is redundant most of the time.
min_impurity_decrease / gamma	Low	Low	Low	High	XGB gamma: 0–5	In XGB this is built into the split-gain math and meaningfully prunes weak splits. Elsewhere mostly ignored.
max_features (per split)	Low	Critical	Medium	Medium	RF: sqrt, log2, 0.3–0.5 XGB colsample_bynode	Reducing it hurts a single tree but is the core of RF's value (decorrelation). Useful as light regularization in boosting.
colsample_bytree	—	—	—	High	0.6–1.0	XGB's per-tree feature sampling. Standard regularization knob for boosting on wider feature sets.
subsample / max_samples	—	Low	High	High	0.6–0.9	Stochastic GBM trick — adds row-level randomness, reduces overfitting, often improves generalization at low cost.
bootstrap	—	Low	—	—	True (default)	Almost always leave True. Turning off makes RF behave like Extra Trees.
reg_lambda (L2)	—	—	—	High	0.1–10 (log scale)	XGB-specific. Shrinks leaf weights directly. Effective when overfitting persists despite depth/subsampling.
reg_alpha (L1)	—	—	—	Medium	0, 0.01–1	Useful for very high-dim data or when you want sparsity in leaf magnitudes. Usually leave at 0.
ccp_alpha	Critical	Low	Low	—	Use pruning path	For a single DT this is the cleanest principled way to size the tree. Redundant in ensembles.
criterion	Low	Low	Low	Low	Default fine	Gini vs entropy / mse vs friedman_mse: rarely move the needle. Spend cycles elsewhere.
splitter	Low	—	—	—	"best"	"random" only useful when hand-rolling ensembles; for a single DT always "best".
loss / objective	—	—	Medium	Medium	Match task	Defaults handle most cases. Choose carefully when you care about robust regression (huber, mae) or ranking.
eval_metric	—	—	—	High	Match business metric	Drives early stopping — set this to what you actually care about (AUC vs logloss matters).
class_weight / scale_pos_weight	High	High	—	High	"balanced" or neg/pos ratio	Always relevant on imbalanced data. GBM uses sample_weight instead.
early_stopping	—	—	Critical	Critical	patience 20–50	Essentially tunes n_estimators for free. Always use in boosting; doesn't apply to bagging.
tree_method	—	—	—	Medium	"hist" for large data	Not accuracy tuning per se, but "hist" is 5–10× faster on large data with almost no accuracy loss.
oob_score	—	Medium	—	—	True (when useful)	Not performance tuning — but a free validation estimate, often replaces a CV loop.
monotonic constraints	Low	Low	—	Low	Domain-driven	Not about accuracy — about model trust, fairness, and regulatory compliance. Set when domain knowledge demands it.
random_state	Low	Low	Low	Low	Any fixed int	Set for reproducibility, but don't search over it — that's overfitting to seed.

3Recommended tuning order per model

A practical sequence to follow when tuning each model — what to set first, second, third. Stop when accuracy plateaus.

DT Decision Tree

Grow tree fully, then use cost_complexity_pruning_path() to get candidate ccp_alphas
Cross-validate over ccp_alpha — this principled axis is usually enough
If you want a multi-dim search instead: max_depth × min_samples_leaf
Set class_weight="balanced" if imbalanced
Leave criterion, splitter, max_features at defaults

RF Random Forest

n_estimators: set to 300–500, don't tune
max_features: actually tune — try sqrt, log2, 0.3, 0.5
min_samples_leaf: try 1, 2, 5, 10
max_depth: usually leave at None; constrain only if overfitting
class_weight="balanced_subsample" if imbalanced
Use oob_score=True to skip a CV loop

GBM sklearn Gradient Boosting

Set learning_rate=0.05, n_estimators=1000+
Enable early stopping: n_iter_no_change=20, validation_fraction=0.15
Tune max_depth: try 3, 5, 7
Tune min_samples_leaf: 1, 5, 20
Add stochasticity: subsample=0.8, max_features="sqrt"
Match loss to your task (huber for outliers, etc.)

XGB XGBoost

Set learning_rate=0.05, n_estimators=2000, early_stopping_rounds=50
Set eval_metric to your true target (e.g. "auc")
Tune max_depth (3–9) and min_child_weight (1–20)
Tune subsample and colsample_bytree (both 0.6–1.0)
Tune reg_lambda (0.1–10, log scale) and gamma (0–5)
Use tree_method="hist" for speed on large data
Set scale_pos_weight if imbalanced binary

4Trees flat-line beyond the training range

The single most-misunderstood failure mode of tree-based models — including random forests and gradient boosting. It is not linear interpolation. It is not graceful degradation. It is something cruder, and you need to know exactly what.

The principle. A decision tree predicts a constant within each leaf. Any input whose feature value lies beyond the training range falls into the boundary leaf (the leftmost or rightmost leaf for that feature) and receives that leaf's constant — not a scaled value, not an interpolation. The prediction plateaus at the edge of what the tree saw and stays flat forever after.

Worked example — promotion lift

A promo team has historical data on discount % vs sales lift %. The discounts have always been between 10–25 %. Marketing wants to try a 40 % flash sale and asks the model what to expect.

discount	training lift	note
10 %	5 %	seen
15 %	12 %	seen
20 %	18 %	seen
25 %	25 %	seen — training max
40 %	?	extrapolation request

What the tree learned:

discount < 12.5 → predict 5
12.5 ≤ discount < 17.5 → predict 12
17.5 ≤ discount < 22.5 → predict 18
discount ≥ 22.5 → predict 25 ← boundary leaf

tree prediction at 40 %

25 % lift

linear regression at 40 %

44 % lift

The tree predicts exactly the same lift for a 40 % promo as for a 25 % promo — because 40 % falls into the same boundary leaf. Linear regression on the same four points (slope ≈ 1.3, intercept ≈ −8) at least scales with the input: 1.3 × 40 − 8 ≈ 44.

Visual — the step function flat-lines

Training data is in the unshaded region (discount 10–25 %). The shaded zone is everything the tree never saw. Watch what each model does past the boundary.

Where this actually hurts you

Pricing & promotions. Trained on discounts up to 25 %; product team launches 40 %. Tree predicts the 25 %-leaf lift exactly. Marketing under-staffs the launch.
Trended time-series. Time-as-feature for a tree means time-index gets bucketed into leaves. Forecasts beyond the training window plateau at the last leaf — popularly described as "the model can't see growth." That's the same flat-extrapolation bug.
Demand forecasting beyond historical peaks. A retail tree trained on 2019–2023 caps its predictions at the 2023 peak — useless for modeling the 2024 holiday surge.
Sensor/regime shifts. Industrial sensor reading drifts past its training range (e.g., temperature exceeds historical maxima). Tree returns the boundary-leaf value as if nothing changed.
RUL / survival models. Asking a tree-based regressor "how many more cycles will this part survive?" when it has already outlasted all training examples — answer pegs at the training maximum.

Mitigations (in order of cleanliness)

Engineer features that don't drift. Use ratios, differences, percent-of-baseline, normalized values — anything bounded — so test inputs land inside training space. (Example: predict lift / baseline_traffic rather than absolute lift.)
Out-of-distribution detection. Track feature ranges in training; flag inputs outside [min, max] per feature and refuse to predict (or fall back to a simpler model).
Ensemble with a linear or monotonic component. Predict the linear trend + tree residual. The linear part extrapolates; the tree captures non-linear structure inside the training region.
XGBoost monotonic constraints (monotone_constraints). Forces the tree to be monotone in chosen features. Doesn't fix flat-extrapolation but at least prevents the absurd cases where a deeper boundary leaf decreases the prediction.
Retrain when feature distributions shift. If 40 %-off promos are now real, your training set must include them. Trees only know what they've literally seen.

Interview phrasing — get this exactly right. Trees do not "do linear interpolation outside the training range." They do flat extrapolation: each leaf is a constant, and the boundary leaf returns its constant value for any input past the boundary. Linear regression at least scales with the input; trees pin to the edge value forever. The distinction is small but it's a fast signal for whether the candidate has only read about trees or has actually had to debug one in production. The same pitfall applies to RandomForestRegressor, GradientBoostingRegressor, XGBRegressor, LGBMRegressor — averaging or boosting a bunch of step-function predictors still gives you a step-function predictor.