The Problem
Predicting fight outcomes sounds simple until you actually try it. The moneyline market — who wins — is near-efficient. Vegas is good at setting lines for the winner, and trying to beat them consistently on moneylines is a losing game for most models. But prop markets — how someone wins (KO, submission, decision), when the fight ends, and over/under rounds — are much less efficient. The data is messier, the markets are thinner, and sportsbooks spend less effort pricing them precisely.
We wanted to build a system that could find value in those prop markets by engineering more features from more sources than a typical bettor has access to, and combining multiple model architectures to avoid the weaknesses of any single one.
The Approach
The system scrapes 7 data sources: ufcstats.com (official fight stats), Sherdog (fighter records and history), BestFightOdds (moneyline and prop lines), Tapology (rankings and event data), ufc.com (official profiles), FightMatrix (algorithmic rankings), and MMADecisions (judge scorecards). All of this feeds into a SQLite database with WAL mode — 75,200 fights and 29,000 fighters at last count.
From that raw data, 67 feature modules generate 878 features per fight. This is where the engineering challenge lives. Basic stats like win rate and finish rate are table stakes. The features that actually matter are derived: reach differential adjusted for stance matchup, takedown defense trend over the last 5 fights, striker-vs-grappler style clash indicators, travel distance to the venue (yes, a fighter crossing 8 time zones to fight performs differently), layoff duration, debut flags, and dozens of rolling statistical aggregates.
APScheduler runs the automation: daily scrapes to keep the database current, a card monitor that checks every 10 minutes for newly announced fights, pre-event auto-logging of predictions, and a live fight-night check every 2 minutes to settle results.
Feature engineering is 80% of the work and 95% of the value. The model architecture matters far less than the quality and diversity of what you feed it. We spent weeks on the XGBoost/LightGBM/CatBoost ensemble, but the biggest accuracy improvements always came from adding a new feature module — not from tuning hyperparameters.
Technical Decisions
Three-model ensemble (XGBoost + LightGBM + CatBoost). Each handles different data characteristics well. XGBoost is strong on structured tabular data. LightGBM is fast and handles categorical features natively. CatBoost is robust to overfitting with small datasets (some prop markets have limited historical data). The ensemble averages their probability outputs, and per-class calibration ensures the probabilities actually mean what they say — a 70% predicted KO probability should hit roughly 70% of the time.
Separate models for each prop type. Method props (KO, submission, decision), decision props, and over/under rounds each get their own model pipeline with their own feature importance rankings and thresholds. The optimal edge threshold differs by market: 5% for KO and submission props, 10% for decisions, 8% for over/under. These thresholds came from grid search optimization and historical backtesting.
SQLite over PostgreSQL for the core database. This might be surprising for a data-heavy application, but the workload is single-writer with analytical reads. WAL mode gives concurrent read access during scrapes. The entire database lives in a single file, which makes deployment to Railway trivial — mount a persistent volume, point at the file. No connection pooling, no ORM, no migration framework beyond the schema file.
FastAPI with Jinja2 templates, not a full React SPA. The web interface serves predictions, prop value analysis, and historical accuracy tracking. The React portion handles the interactive prediction dashboard, but most pages are server-rendered Jinja2 templates. For an internal tool, server rendering is simpler and faster to iterate.
Prop value detection with Kelly sizing. The find_prop_value script identifies bets where the model's predicted probability exceeds the implied odds by more than the edge threshold. Kelly criterion determines suggested bet sizing based on the estimated edge. Historical backtesting showed +617 units on method props, +56 units on over/under, and +36 units on decisions. These are backtested numbers — real betting results would differ due to execution factors.
What We Learned
The biggest lesson was about data leakage. Our initial over/under model looked unreasonably accurate until we discovered that one of the judge-related features was leaking post-fight information into the training set. Removing it dropped the accuracy meaningfully but made the model actually predictive. Every ML practitioner knows about data leakage in theory. Finding it in your own code, after you have been excited about your results for a week, is a different experience.
The second lesson was about calibration vs. accuracy. A model can be accurate in aggregate but poorly calibrated — meaning its 60% predictions hit 45% of the time while its 80% predictions hit 90%. Per-class calibration using isotonic regression fixed this and made the prop value calculations actually trustworthy. In sports betting, calibration matters more than raw accuracy because your edge depends on the gap between your probability and the market's implied probability.
763 tests keep all of this honest. When we add a new feature module, the test suite verifies that feature extraction does not crash on missing data, that the model pipeline still trains, and that prediction outputs stay within expected ranges. For a system touching 7 external data sources with 67 feature modules, that safety net is not optional.
Read about our feature engineering approach — 878 features and still learning.
Building a prediction model? Book a free call — we've shipped 878 features in production ML.