XGBoost vs LightGBM vs CatBoost

Gradient boosted decision trees still win Kaggle and still win most tabular ML problems we encounter in production. After deep learning ate everything else, GBDTs quietly remained the right answer for “I have a CSV of features and a target column.” The three credible libraries — XGBoost, LightGBM, and CatBoost — are all genuinely excellent. The choice is less about accuracy (similar within noise on most problems) and more about training speed, categorical feature handling, and ecosystem fit.

We’ve shipped all three on banking fraud detection, healthcare risk stratification, and supply chain demand forecasting. Here’s how the choice plays out when you’re past benchmark plots.

The thirty-second framing#

XGBoost is the original. Battle-tested, broad ecosystem, good defaults. The reference implementation everyone benchmarks against.
LightGBM is Microsoft’s faster gradient booster. Histogram-based, leaf-wise tree growth. Often 2-5x faster training than XGBoost on the same data.
CatBoost is Yandex’s gradient booster optimized for categorical features. Handles categoricals natively without preprocessing — uses ordered boosting to avoid target leakage.

All three implement the same fundamental algorithm (gradient boosted trees). The differences are in: how they bin features, how they grow trees, how they handle missing values, and how they handle categoricals.

What’s actually different#

Dimension	XGBoost	LightGBM	CatBoost
Training speed (1M rows)	Baseline	2-4x faster	Slightly slower than LightGBM
Inference speed	Fast	Fast	Slightly slower
Categorical features	One-hot or label encode yourself	Native (via index), better than nothing	Native + ordered boosting (best in class)
Missing values	Native	Native	Native
Tree growth	Level-wise (depth-first balanced)	Leaf-wise (greedy, deeper trees)	Symmetric (oblivious trees)
GPU support	Mature	Mature	Mature
Distributed training	Spark/Dask/Ray	Spark/Dask	Spark
Default overfit behavior	Conservative	Aggressive (needs `num_leaves` tuning)	Conservative
Hyperparameters to tune	~6-8	~6-8	~4-5 (fewer needed)
Model size on disk	Medium	Medium	Larger (CatBoost stores categorical mappings)
Inference deployment	XGBoost C API, ONNX, Treelite	LightGBM C API, ONNX, Treelite	CatBoost C API, ONNX, CoreML, custom

Where XGBoost wins#

Ecosystem breadth. The biggest body of tutorials, the most Stack Overflow answers, the most production deployments at scale. If your team needs to hire, XGBoost is the most likely existing skill.

Sklearn compatibility. XGBoost’s sklearn interface (XGBClassifier, XGBRegressor) is the cleanest of the three. Slots into existing sklearn pipelines without friction.

Conservative defaults. XGBoost is hard to badly overfit out of the box. LightGBM with default num_leaves=31 will overfit on small datasets if you’re not paying attention. For “first model on this dataset, just want a baseline,” XGBoost is forgiving.

SHAP integration. All three have SHAP, but XGBoost’s SHAP integration is most mature and best documented.

Where LightGBM wins#

Training speed. On any dataset over 100k rows, LightGBM is meaningfully faster. For datasets where you’re iterating (hyperparameter search, feature engineering, time-budget constraints), the wall-clock difference compounds.

Memory efficiency. Histogram-based splits use less memory than XGBoost’s exact split-finding. Lets you fit larger datasets on the same hardware.

Large dataset performance. For >10M rows, LightGBM is usually the practical default. XGBoost works but takes meaningfully longer.

LightGBM’s bagging_fraction and feature_fraction. Built-in stochastic boosting — gives you XGBoost’s subsample behavior plus column subsampling at the tree level (not just node level).

The catch: LightGBM’s defaults overfit small datasets. Tune num_leaves, min_data_in_leaf, and lambda_l2 more carefully than you would XGBoost.

Where CatBoost wins#

Categorical features. This is the real differentiator. CatBoost handles categorical features via ordered target statistics, which avoids the target leakage that naive target encoding causes. For datasets with high-cardinality categoricals (postal codes, user IDs, product IDs), CatBoost out-of-the-box often beats XGBoost or LightGBM with hand-crafted target encoding.

Less hyperparameter tuning. CatBoost’s defaults are designed to “just work” without much tuning. For teams that don’t want to run extensive HPO, this is real value.

Symmetric (oblivious) trees. All splits at a given depth use the same feature/threshold. Faster inference; sometimes slightly worse accuracy. Trade-off favors latency-sensitive serving.

Built-in text features (rough). CatBoost has some text-feature support that XGBoost/LightGBM lack. Useful for “I have a free-form description column and don’t want to wire up a separate NLP pipeline.”

The catch: CatBoost has a smaller ecosystem. Fewer tutorials, fewer engineers know it deeply. The library’s API drifts more often.

Honest accuracy comparison#

On standard benchmarks (tabular UCI / Kaggle datasets), the three are within ±1-2% of each other after careful hyperparameter tuning. The “which is most accurate” question is mostly noise.

What DOES differ accuracy-wise:

Datasets with many high-cardinality categoricals: CatBoost wins without preprocessing. With careful target encoding, XGBoost/LightGBM catch up.
Very large datasets with lots of features: LightGBM’s speed advantage means you can do more HPO in the same time budget, which often translates to slightly better tuned models.
Datasets with lots of missing values: All three handle them natively; differences are small.

For most production problems, the choice doesn’t move accuracy meaningfully. It moves training time, deployment ergonomics, and team productivity.

Production deployment patterns#

All three have similar production paths:

Native inference: each library has a fast C inference API. Load model, predict. Sub-millisecond per prediction.
ONNX export: convert to ONNX, serve via ONNX Runtime. Same throughput, decoupled from training library version.
Treelite + treelite-runtime: compile the model to a shared library for maximum performance. ~10x faster than native for some workloads. Underused.
MLflow / BentoML / SageMaker: wrap any of the above with a serving framework. See our piece on MLOps experiment tracking for tooling.

In production, we usually deploy as a containerized FastAPI service with the model loaded into memory. For latency-critical workloads (real-time fraud scoring at a bank), Treelite compilation is worth the effort.

What we deploy by default#

For a new tabular ML problem on a real production workload:

First model: XGBoost. Sklearn API, get a baseline in 30 lines, tune later. ✓ Default for fraud detection, hospital risk scoring, churn prediction.
Switch to LightGBM if: dataset is >5M rows, training time is in the critical path, or you’re doing extensive HPO and need to fit more experiments per hour.
Switch to CatBoost if: the dataset has high-cardinality categoricals (postal codes, customer IDs, product SKUs) and you don’t want to engineer target-encoded features yourself.

We rarely use all three on the same project. The migration cost (test set parity, monitoring integration, etc.) usually isn’t worth the marginal accuracy gain.

What you should tune for any of them#

If you’re hyperparameter-tuning a GBDT, these are the levers that matter most:

Tree depth / num_leaves: shallow trees overfit less, deep trees fit harder. Start at depth 6-8 (XGBoost) or num_leaves 31-63 (LightGBM).
Learning rate: lower learning rate + more trees = better. Start at 0.05-0.1, add early stopping.
Subsample / colsample_bytree: stochastic boosting helps generalization. 0.7-0.9 range is the sweet spot.
Regularization (lambda, alpha): L1/L2 on leaf weights. Most underused lever.
Min child weight / min data in leaf: minimum samples per leaf. Raise this if overfitting.

Skip the rest unless you have time. The defaults are good.

The thing none of them solve#

All three give you a model. None of them solve:

Feature engineering (still where most of the lift comes from)
Train/test split methodology (still where most teams fail subtly)
Concept drift detection in production (still your job — see LLM observability piece for the tooling pattern, applies here too)
Calibration (for probability outputs, you often need post-hoc calibration via Platt scaling or isotonic regression)
Threshold selection (depends on cost matrix, not on the model)

Pick the library that minimizes friction for your team and dataset. Spend the saved time on the problem-shaped work above.

The pattern of patterns#

Gradient boosters are a commodity. The differences between the three matter, but they don’t matter as much as the marketing suggests. Pick one your team is comfortable operating, stick with it for most problems, and switch only when one of the specific advantages (LightGBM speed, CatBoost categoricals) actually matters for the project.

The teams that ship reliable ML systems aren’t the ones obsessed with picking the optimal library. They’re the ones obsessed with the data quality, the train/test methodology, and the production monitoring.

The library is the easy part. The hard parts are data quality and drift detection. If you’re building a production ML system and want a second pair of eyes, our ML & MLOps team has shipped all three in production. Tell us about the workload.