The Problem
Financial sentiment analysis sounds like a solved problem until you try to build one that actually works. Most sentiment tools slap a pretrained model on tweets and call it a day. The result is noise — sarcasm reads as bullish, memes read as negative, and the signal-to-noise ratio makes the output useless for any real trading decision.
We wanted something different: a pipeline that pulls from multiple sources (Reddit, Twitter, and financial news), applies domain-specific scoring tuned for market language, and delivers aggregated sentiment scores through an API fast enough to be useful during market hours.
The Approach
The pipeline runs three parallel ingestion tracks. Reddit scraping pulls from finance-specific subreddits — r/wallstreetbets, r/stocks, r/cryptocurrency — using PRAW. Twitter ingestion via Tweepy filters for cashtags ($AAPL, $BTC) and financial accounts. The news track scrapes financial headlines from multiple outlets using BeautifulSoup.
Raw text goes through a preprocessing pipeline: cleaning HTML artifacts, normalizing ticker mentions, expanding common abbreviations (DD = due diligence, not a stock symbol), and filtering spam. This step matters more than the model itself — garbage text produces garbage sentiment regardless of how good your NLP is.
The hardest part of financial NLP is not the models — it is the preprocessing. "AAPL to the moon 🚀🚀🚀" is obviously bullish, but standard sentiment models score it as neutral because they do not understand financial slang. Domain-specific tokenization and a custom lexicon for market language made a bigger accuracy difference than switching model architectures.
Technical Decisions
NLTK with a custom financial lexicon over transformer models. Transformers give marginally better accuracy on benchmarks, but at 10-100x the latency. For a real-time scoring pipeline processing hundreds of posts per minute during market hours, NLTK with VADER plus a custom financial lexicon (200+ domain-specific terms with manual sentiment weights) hits the right tradeoff. The lexicon includes terms like "moon" (positive), "bag holding" (negative), "diamond hands" (positive), and "rug pull" (strongly negative) that generic models miss entirely.
Redis for caching and rate limiting. Each source has different rate limits — Twitter is the strictest. Redis stores cached scores with TTL-based expiration, deduplicates posts across ingestion cycles, and tracks API rate limit windows. The cache also serves as the primary read layer for the FastAPI endpoints, so API responses come from Redis rather than hitting the scoring pipeline directly.
Per-source confidence weighting. Not all sources are equal. A detailed Reddit DD post with analysis carries more signal than a one-line tweet. The scoring engine weights sentiment by source type, post length, author karma/followers, and engagement metrics. A heavily-upvoted bearish analysis on r/stocks moves the aggregate score more than a hundred low-engagement tweets.
FastAPI with WebSocket support for streaming scores. The REST API serves current aggregated sentiment per asset, historical sentiment trends, and source breakdown. A WebSocket endpoint streams live score updates as new data arrives during market hours. Pydantic models validate everything in and out.
What We Learned
The biggest takeaway was that aggregation strategy matters more than individual post accuracy. Even with a mediocre per-post sentiment classifier, aggregating across hundreds of posts per asset per hour produces a surprisingly stable and directionally useful signal. The noise cancels out when you have enough volume — which is the whole point of pulling from three different sources.
The second lesson was about temporal decay. A bearish Reddit post from 6 hours ago should weigh less than a bullish one from 10 minutes ago, especially during volatile market events. Implementing exponential time decay on sentiment scores dramatically improved the correlation between the pipeline output and actual short-term price movements.
Need real-time NLP analysis? Let's talk about your data pipeline.