Tree-based model hyperparameters

A side-by-side reference of decision trees, random forests, gradient boosting (sklearn), and XGBoost — grouped by functional category, with a separate tuning-priority view that shows which knobs actually move the needle for each model.

DT Decision Tree RF Random Forest GBM sklearn Gradient Boosting XGB XGBoost

1Hyperparameter map across models

Every common hyperparameter, grouped by what it actually does. Empty cells mean the parameter doesn't exist in that model. Where names differ across libraries, the model-specific name is shown.

Parameter Purpose DT RF GBM XGB
Core / ensemble structure
n_estimators Number of trees in the ensemble n_estimators n_estimators n_estimators
learning_rate Step size scaling each tree's contribution learning_rate learning_rate / eta
booster type Base learner family (tree / linear / dart) booster
init estimator / F₀ Initial constant prediction baseline init base_score
Splitting criterion
split criterion Impurity / quality measure for picking splits criterion criterion criterion objective (built into loss)
splitter strategy Best vs random feature/threshold search splitter tree_method (hist/exact/approx)
grow policy Depth-wise vs best-first leaf expansion grow_policy
Pre-pruning / tree shape
max_depth Maximum tree depth max_depth max_depth max_depth max_depth
max_leaf_nodes / max_leaves Cap on total leaves (alternative to depth) max_leaf_nodes max_leaf_nodes max_leaf_nodes max_leaves
min_samples_split Min samples required to consider a split min_samples_split min_samples_split min_samples_split
min_samples_leaf / min_child_weight Min samples / Hessian sum required per leaf min_samples_leaf min_samples_leaf min_samples_leaf min_child_weight
min_weight_fraction_leaf Weight-based leaf size constraint min_weight_fraction_leaf min_weight_fraction_leaf min_weight_fraction_leaf
min_impurity_decrease / gamma Min gain required to allow a split min_impurity_decrease min_impurity_decrease min_impurity_decrease gamma
Feature subsampling
max_features Features considered per split max_features max_features max_features colsample_bynode
features per tree Features sampled once per tree colsample_bytree
features per level Features sampled per tree depth level colsample_bylevel
Row subsampling (stochasticity)
bootstrap Sample rows with replacement per tree bootstrap
max_samples / subsample Fraction of rows used per tree max_samples subsample subsample
oob_score Use out-of-bag samples for free validation oob_score
Regularization on leaf values
L2 on leaf weights Shrinks leaf magnitudes toward zero reg_lambda
L1 on leaf weights Sparsity on leaf values reg_alpha
cost-complexity pruning Post-pruning by leaf cost ccp_alpha ccp_alpha ccp_alpha
max_delta_step Caps leaf weight change per update (imbalance) max_delta_step
Loss / objective
loss function What's being optimized loss objective
eval metric Metric monitored during training eval_metric
Class balance
class_weight Per-class weighting in impurity class_weight class_weight
scale_pos_weight Gradient scaling for positive class (binary) scale_pos_weight
sample_weight Per-row weights passed to fit() ✓ (fit arg) ✓ (fit arg) ✓ (fit arg) ✓ (fit arg)
Early stopping
validation_fraction Hold-out fraction for early-stop monitoring validation_fraction
n_iter_no_change / early_stopping_rounds Patience before stopping n_iter_no_change early_stopping_rounds
tol Min improvement to count as progress tol
Constraints (domain knowledge)
monotonic_cst Force monotonic increase/decrease per feature monotonic_cst monotonic_cst monotone_constraints
interaction_constraints Restrict which features can interact interaction_constraints
Compute / system
n_jobs Parallel CPU threads n_jobs n_jobs / nthread
device / GPU CPU vs GPU training device
tree_method Split-finding algorithm (exact / hist / approx) tree_method
random_state Reproducibility seed random_state random_state random_state random_state
warm_start Incremental fitting (add more estimators) warm_start warm_start
verbose Training progress output verbose verbose verbose verbosity
Missing values
missing handling Native NaN handling at split time manual manual manual missing (native)

2Tuning priority by model

Not all parameters matter equally — and a parameter that's critical for one model may be irrelevant for another. This table ranks each parameter by how much it actually affects performance in each model, with typical tuning ranges.

Critical Tune first, biggest impact High Worth tuning, real gains Medium Minor tuning, edge cases Low Defaults usually fine Not applicable
Parameter DT RF GBM XGB Typical range Why it matters where it matters
n_estimators Medium Critical Critical RF: 200–500
Boost: 1000–5000 + ES
In RF more = better with diminishing returns. In boosting it's the regularizer — too many overfits; use early stopping.
learning_rate Critical Critical 0.01–0.1 Lower lr + more trees nearly always beats higher lr + fewer trees. Pairs with n_estimators.
max_depth Critical Low High Critical DT: 3–20 (CV)
RF: None
Boost: 3–8
DT's main overfit lever. RF wants deep trees (averaging fixes it). Boosting wants shallow trees (correction model).
min_samples_leaf / min_child_weight High Medium Medium High 1, 5, 10, 20+ Often more effective than max_depth for DT. In XGB it's in Hessian-space — the second tuning lever after depth.
min_samples_split Medium Low Low 2, 10, 20 min_samples_leaf is usually the better lever; this one is redundant most of the time.
min_impurity_decrease / gamma Low Low Low High XGB gamma: 0–5 In XGB this is built into the split-gain math and meaningfully prunes weak splits. Elsewhere mostly ignored.
max_features (per split) Low Critical Medium Medium RF: sqrt, log2, 0.3–0.5
XGB colsample_bynode
Reducing it hurts a single tree but is the core of RF's value (decorrelation). Useful as light regularization in boosting.
colsample_bytree High 0.6–1.0 XGB's per-tree feature sampling. Standard regularization knob for boosting on wider feature sets.
subsample / max_samples Low High High 0.6–0.9 Stochastic GBM trick — adds row-level randomness, reduces overfitting, often improves generalization at low cost.
bootstrap Low True (default) Almost always leave True. Turning off makes RF behave like Extra Trees.
reg_lambda (L2) High 0.1–10 (log scale) XGB-specific. Shrinks leaf weights directly. Effective when overfitting persists despite depth/subsampling.
reg_alpha (L1) Medium 0, 0.01–1 Useful for very high-dim data or when you want sparsity in leaf magnitudes. Usually leave at 0.
ccp_alpha Critical Low Low Use pruning path For a single DT this is the cleanest principled way to size the tree. Redundant in ensembles.
criterion Low Low Low Low Default fine Gini vs entropy / mse vs friedman_mse: rarely move the needle. Spend cycles elsewhere.
splitter Low "best" "random" only useful when hand-rolling ensembles; for a single DT always "best".
loss / objective Medium Medium Match task Defaults handle most cases. Choose carefully when you care about robust regression (huber, mae) or ranking.
eval_metric High Match business metric Drives early stopping — set this to what you actually care about (AUC vs logloss matters).
class_weight / scale_pos_weight High High High "balanced" or neg/pos ratio Always relevant on imbalanced data. GBM uses sample_weight instead.
early_stopping Critical Critical patience 20–50 Essentially tunes n_estimators for free. Always use in boosting; doesn't apply to bagging.
tree_method Medium "hist" for large data Not accuracy tuning per se, but "hist" is 5–10× faster on large data with almost no accuracy loss.
oob_score Medium True (when useful) Not performance tuning — but a free validation estimate, often replaces a CV loop.
monotonic constraints Low Low Low Domain-driven Not about accuracy — about model trust, fairness, and regulatory compliance. Set when domain knowledge demands it.
random_state Low Low Low Low Any fixed int Set for reproducibility, but don't search over it — that's overfitting to seed.

3Recommended tuning order per model

A practical sequence to follow when tuning each model — what to set first, second, third. Stop when accuracy plateaus.

DT Decision Tree

  1. Grow tree fully, then use cost_complexity_pruning_path() to get candidate ccp_alphas
  2. Cross-validate over ccp_alpha — this principled axis is usually enough
  3. If you want a multi-dim search instead: max_depth × min_samples_leaf
  4. Set class_weight="balanced" if imbalanced
  5. Leave criterion, splitter, max_features at defaults

RF Random Forest

  1. n_estimators: set to 300–500, don't tune
  2. max_features: actually tune — try sqrt, log2, 0.3, 0.5
  3. min_samples_leaf: try 1, 2, 5, 10
  4. max_depth: usually leave at None; constrain only if overfitting
  5. class_weight="balanced_subsample" if imbalanced
  6. Use oob_score=True to skip a CV loop

GBM sklearn Gradient Boosting

  1. Set learning_rate=0.05, n_estimators=1000+
  2. Enable early stopping: n_iter_no_change=20, validation_fraction=0.15
  3. Tune max_depth: try 3, 5, 7
  4. Tune min_samples_leaf: 1, 5, 20
  5. Add stochasticity: subsample=0.8, max_features="sqrt"
  6. Match loss to your task (huber for outliers, etc.)

XGB XGBoost

  1. Set learning_rate=0.05, n_estimators=2000, early_stopping_rounds=50
  2. Set eval_metric to your true target (e.g. "auc")
  3. Tune max_depth (3–9) and min_child_weight (1–20)
  4. Tune subsample and colsample_bytree (both 0.6–1.0)
  5. Tune reg_lambda (0.1–10, log scale) and gamma (0–5)
  6. Use tree_method="hist" for speed on large data
  7. Set scale_pos_weight if imbalanced binary

4Trees flat-line beyond the training range

The single most-misunderstood failure mode of tree-based models — including random forests and gradient boosting. It is not linear interpolation. It is not graceful degradation. It is something cruder, and you need to know exactly what.

The principle. A decision tree predicts a constant within each leaf. Any input whose feature value lies beyond the training range falls into the boundary leaf (the leftmost or rightmost leaf for that feature) and receives that leaf's constant — not a scaled value, not an interpolation. The prediction plateaus at the edge of what the tree saw and stays flat forever after.

Worked example — promotion lift

A promo team has historical data on discount % vs sales lift %. The discounts have always been between 10–25 %. Marketing wants to try a 40 % flash sale and asks the model what to expect.
discounttraining liftnote
10 %5 %seen
15 %12 %seen
20 %18 %seen
25 %25 %seen — training max
40 %?extrapolation request
What the tree learned:
  • discount < 12.5 → predict 5
  • 12.5 ≤ discount < 17.5 → predict 12
  • 17.5 ≤ discount < 22.5 → predict 18
  • discount ≥ 22.5 → predict 25  ← boundary leaf
tree prediction at 40 %
25 % lift
linear regression at 40 %
44 % lift

The tree predicts exactly the same lift for a 40 % promo as for a 25 % promo — because 40 % falls into the same boundary leaf. Linear regression on the same four points (slope ≈ 1.3, intercept ≈ −8) at least scales with the input: 1.3 × 40 − 8 ≈ 44.

Visual — the step function flat-lines

Training data is in the unshaded region (discount 10–25 %). The shaded zone is everything the tree never saw. Watch what each model does past the boundary.
0 10 15 20 25 40 50 discount % 0 5 12 18 25 44 sales lift % ← training max linear regression tree (flat at 25) 25 % (tree) 44 % (linear) 19-point gap extrapolation zone — tree never saw this

Where this actually hurts you

Mitigations (in order of cleanliness)

Interview phrasing — get this exactly right. Trees do not "do linear interpolation outside the training range." They do flat extrapolation: each leaf is a constant, and the boundary leaf returns its constant value for any input past the boundary. Linear regression at least scales with the input; trees pin to the edge value forever. The distinction is small but it's a fast signal for whether the candidate has only read about trees or has actually had to debug one in production. The same pitfall applies to RandomForestRegressor, GradientBoostingRegressor, XGBRegressor, LGBMRegressor — averaging or boosting a bunch of step-function predictors still gives you a step-function predictor.