878 Features and Still Learning: Building an ML Fight Predictor

Predicting UFC fight outcomes sounds like a fun weekend project until you actually try it. The model training part — the part most people think is the hard part — took maybe two days. The data pipeline, feature engineering, and calibration work that makes the predictions actually useful? That's been months of iterating, and we're still finding bugs that matter.

Here's what we've learned building a system that scrapes fighter data from seven sources, engineers 878 features per fight, trains an ensemble of three gradient boosting models, and tries to find betting value in prop markets.

The Data Problem

Before you can predict anything, you need data. UFC fight data is scattered across half a dozen websites, each with different coverage, different formats, and different ideas about what constitutes a "fight record."

The scraper system hits seven sources: ufcstats.com for official fight stats, Sherdog for career records, BestFightOdds for betting lines (moneyline and props), Tapology for fight-night details, ufc.com for official bios, FightMatrix for rankings history, and MMA Decisions for scorecards. Each source gets its own scraper module, and all of them feed into a SQLite database that currently holds around 75,000 fights and 29,000 fighters.

The scraper runs on a schedule — daily full scrapes, with a card monitor checking every ten minutes during fight week for late changes (last-minute replacements happen constantly in MMA). An APScheduler instance manages the timing, and the whole thing deploys on Railway with a persistent SQLite volume.

We chose SQLite over PostgreSQL deliberately. The dataset fits comfortably in a single file, WAL mode handles concurrent reads from the API while the scraper writes, and Railway's persistent volumes make it work in production. No connection pooling, no separate database service, no credentials to manage. For this scale, it's the right tool.

Why 878 Features?

The feature count grew organically as we kept finding new signals. It started simple — win percentage, knockout rate, age, reach advantage — maybe 30 features. Then we added fight-by-fight stats: strikes landed per minute, takedown accuracy, submission attempts. That got us to about 100.

Then the real engineering started. Derived features compound the basic stats in ways that capture fighting dynamics:

# Example: momentum features capture recent trajectory
def compute_momentum_features(fighter_fights: list[Fight]) -> dict:
    recent = fighter_fights[:5]  # last 5 fights
    return {
        "win_streak": count_consecutive_wins(fighter_fights),
        "recent_ko_rate": sum(1 for f in recent if f.method == "KO") / max(len(recent), 1),
        "avg_fight_time_recent": mean([f.total_time for f in recent]),
        "finish_rate_trend": finish_rate(recent) - finish_rate(fighter_fights),
        "days_since_last_fight": (today - recent[0].date).days if recent else 999,
    }

Each of the 67 feature modules computes a slice of the full feature vector. There are geographic features (travel distance, altitude change, timezone shift), stylistic matchup features (striker vs. grappler interactions), situational flags (debut, long layoff, short notice replacement), and historical calibration features that encode how well the fighter performs against different archetypes.

The feature count hit 878 and we stopped adding more — not because we ran out of ideas, but because we started hitting diminishing returns. The models' feature importance scores showed that maybe 200 features do most of the heavy lifting. The rest add noise that the ensemble averages out.

The Ensemble: Three Models, One Vote

We train three gradient boosting models: XGBoost, LightGBM, and CatBoost. Each handles the same feature set but with different internal mechanics — different tree-building algorithms, different handling of categorical features, different regularization approaches.

The ensemble prediction is a weighted average of the three, with weights tuned via grid search on a held-out validation set. The win prediction model achieves about 69% accuracy on test data, with a Brier score of 0.208. For context, the betting market implied probabilities tend to sit around 70-72% accuracy, so the model is competitive but not beating the market on straight win/loss predictions.

That's actually the key insight that shaped the whole project's direction.

The Moneyline Is (Mostly) Efficient

The win/loss market — who wins the fight — is priced efficiently enough that consistent profit is extremely difficult. Vegas employs sharp oddsmakers, and the public betting volume on main events pushes the lines close to true probability.

So we stopped trying to beat the moneyline and focused on prop markets instead.

Prop bets are wagers on specific outcomes within a fight: Will it end by knockout? Will it go the distance? Over/under 2.5 rounds? These markets are less liquid, less scrutinized, and — as it turns out — less efficiently priced.

The prop betting system trains separate models for method of victory (KO, submission, decision) and over/under round totals. Each model type has its own calibration pipeline:

Method props: Per-class calibrated probabilities with optimized edge thresholds (5% for KO/SUB, 10% for decision). Backtests show +617 units of profit.
Over/under props: Three-model ensemble with temperature scaling and logit offset. After finding and fixing a judge feature leak (features that used post-fight judge data that wouldn't be available pre-fight), this hit +56 units.
Decision props: Re-enabled after threshold optimization, running at +36 units.

The judge feature leak is worth highlighting because it's a classic ML trap. Two features — diff_judge_bias_x_closeness and diff_judge_consistency_x_style — had 34% and 32% feature importance respectively in the over/under model. They looked great in training. Problem: judges aren't assigned until fight night, so these features are zero at prediction time. The model was essentially memorizing training data through information that wouldn't exist when it mattered. Removing them dropped the Brier score by 3.1% and turned the backtest from -15 units to +56 units.

Why 763 Tests Matter in ML

Testing ML systems is different from testing web applications. You're not just checking that functions return expected values — you're validating that data transformations are correct, that feature computations handle edge cases, and that the pipeline produces consistent results across runs.

Our test suite covers:

Scraper parsing: Does each source's HTML get parsed into the right schema? When a fighter has no photo or a missing record, does it degrade gracefully?
Feature computation: For known fighter matchups with known stats, do the 878 features compute to expected values? This catches regressions when we refactor feature modules.
Pipeline integrity: Does a full run from raw data to predictions produce the same output deterministically? Floating point tolerance matters here.
Edge cases: Fighters with one fight. Fighters who changed weight classes. Cards that got reshuffled at the last minute. Fights that ended in no-contest.

Without this test coverage, we'd have no confidence that a refactor to the feature pipeline didn't subtly break predictions. ML bugs don't throw exceptions — they just make your model slightly worse in ways that are hard to notice until you've lost money.

Honest About the Limits

The backtests look great. +555 units across all prop types at optimized thresholds. Our first eight settled live bets went 4-4 with +46 units profit (one big submarine bet at +2600 carried the day).

But here's what we tell anyone who asks: backtests are not live results. Optimized thresholds are fit to historical data. The live results are a sample size of eight, which is statistically meaningless. We're tracking everything — every bet logged to the database with the model's predicted probability, the market odds, and the actual outcome — so that over hundreds of bets, we'll know whether the edge is real or an artifact of overfitting.

ML is probabilistic, not certain. A 70% accurate model is wrong three times out of ten. A profitable betting system can have losing months. The discipline is in the process: consistent methodology, honest evaluation, and the willingness to shut it down if the live numbers don't match the backtests.

The most valuable thing this project taught us isn't about machine learning — it's about the difference between building something that works in a notebook and building something that works in production, on a schedule, with real money on the line. Those are very different engineering problems.

What's Next

The method prediction model has a Brier score of 0.402 with significant calibration gaps — the model under-predicts finishes by 13% for knockouts and 8% for submissions. That's the biggest opportunity for improvement. We're running autoresearch loops to tune the calibration, and if we can close those gaps, the prop betting edge should widen.

We're also exploring per-fight adaptive blending — using fight-specific features to adjust the ensemble weights dynamically rather than using static weights across all predictions. A heavyweight slugfest and a flyweight technical match probably shouldn't weight the three models identically.

878 features, and we're still learning.

Building an ML product? See our UFC Predictor case study or book a free call to discuss your project.

878 Features and Still Learning: Building an ML Fight Predictor

878 Features and Still Learning: Building an ML Fight Predictor

The Data Problem

Why 878 Features?

The Ensemble: Three Models, One Vote

The Moneyline Is (Mostly) Efficient

Why 763 Tests Matter in ML

Honest About the Limits

What's Next

Need Help Building Something Like This?