Streaming Algorithms
Algorithms designed to process massive data streams in a single pass (or few passes) using memory that is sublinear—often logarithmic or even constant—in the input size. When data is too large to store, streaming algorithms compute approximate answers using probabilistic data structures called “sketches.” This field represents a paradigm shift: from exact computation with unlimited memory to approximate computation with radical memory constraints.
Advanced
- Basic concepts (sampling, sketching): Intermediate to Advanced
- Probabilistic analysis and tail bounds: Advanced
- Communication complexity lower bounds: Advanced
- Linear sketching and compressed sensing: Advanced
- Graph streaming algorithms: Advanced
The scale problem is real: Modern data volumes fundamentally exceed what fits in memory:
- Google processes 20+ billion web pages (crawling and indexing)
- Facebook sees 4+ petabytes of new data daily
- Network routers handle 100+ Gbps traffic (can’t store, must process on-the-fly)
- Sensor networks, IoT devices, financial trading—all generate unbounded streams
Traditional algorithms fail catastrophically at this scale. You physically cannot store the input, let alone make multiple passes over it. Streaming algorithms provide the only path forward.
War Story - The Birth of Streaming Algorithms: In 1985, Flajolet and Martin proved something remarkable: You can approximate the 1st order frequency moment using only sublinear space via a randomized sketching algorithm. This breakthrough was later popularized by Alon, Matias and Szegedy in 1996 earned the 2005 Gödel Prize and launched the field of streaming algorithms.
Before Flajolet-Martin, processing massive datasets meant parallelizing across clusters. After Flajolet-Martin, we realized fundamentally different algorithmic approaches were possible—process data once, keep tiny summaries, maintain provable accuracy guarantees.
War Story - HyperLogLog at Scale: Counting distinct elements in a massive stream is fundamental: unique visitors to a website, distinct IP addresses, unique search queries. Exact counting requires storing every element seen—impossible at scale.
HyperLogLog counts distinct elements with 0.8% standard error using only 1.5 KB of memory—regardless of stream size! Google, Reddit, and Amazon deploy HyperLogLog for real-time analytics on billions of events. The algorithm is mathematically beautiful: it estimates cardinality from the pattern of leading zeros in hashed values.
This single algorithm transformed big data analytics, enabling real-time dashboards that were previously impossible.
- Reservoir Sampling - Uniform random sample from stream
- Count-Min Sketch - Frequency estimation for heavy hitters
- HyperLogLog - Distinct element counting (cardinality estimation)
- Bloom Filters - Probabilistic set membership testing
- AMS Sketch - Frequency moments and self-join size estimation
- Misra-Gries - Deterministic frequent elements
- BJKST Algorithm - Distinct elements in polylogarithmic space
- Count Sketch - Point queries with better accuracy
- Graph Streaming - Connectivity, matchings, spanners in streams
- Linear Sketching - Johnson-Lindenstrauss lemma, compressed sensing
- Sliding Window Algorithms - Statistics over recent elements only
Required:
- Randomized algorithms and probabilistic analysis
- Tail bounds (Markov, Chebyshev, Chernoff)
- Hash functions and their properties
- Basic linear algebra (for linear sketches)
Helpful:
- Communication complexity basics
- Information theory fundamentals
- Random matrix theory (for advanced topics)
Setup: You’re running a major website. You want to count unique visitors daily. Your site gets 100 million visits, but only 20 million are unique (many people visit multiple times). How do you count uniques?
Naive approach: Store every visitor ID in a hash set. At 100M visitors with 16-byte IDs, that’s 1.6 GB of memory. Multiply across thousands of pages being tracked—infeasible.
First insight: You don’t need exact counts; 1% error is fine. Can approximation save memory?
Breakthrough - HyperLogLog: Hash each visitor ID. Look at the pattern of leading zeros in binary representation. If you see k leading zeros, there are roughly 2^k distinct elements (probabilistically). Combine multiple estimators for robustness.
The magic: 1.5 KB memory, 0.8% error, works for billions of elements. Memory usage is independent of stream size!
Generalization: This teaches the fundamental streaming insight—clever random projections can compress massive data into tiny sketches while preserving statistical properties.
Setup: Network security needs to detect DDoS attacks by finding IP addresses sending abnormally many packets. Millions of packets per second from millions of IPs. You need to find IPs responsible for >1% of traffic. You can’t store per-IP counts—that’s millions of counters.
Naive approach: Hash table with counters. But with millions of IPs, this requires huge memory.
Streaming insight: You don’t care about IPs with few packets; you only care about heavy hitters (>1% of traffic). Can we track only the important ones?
Count-Min Sketch solution:
- Use multiple hash functions mapping IPs to small arrays of counters
- Increment all corresponding counters for each packet
- Estimate count by taking minimum across hash functions
- Collisions cause overestimates, never underestimates
Memory usage: O((1/ε) log(1/δ)) counters for ε-approximation with probability 1-δ. Typically kilobytes instead of megabytes.
Real-world impact: Deployed in network routers, intrusion detection systems, DDoS mitigation. Enables real-time monitoring at line speed.
Setup: Edges of a massive graph arrive as a stream. You need to determine if the graph is connected. Traditional algorithms require storing the entire graph—impossible for billion-edge graphs.
Challenge: With small memory, you can’t store the graph. But connectivity requires understanding global structure.
Streaming approach:
- Sample edges uniformly at random (reservoir sampling)
- Use union-find on sampled edges
- If sample is connected, full graph is likely connected (with analysis)
Deeper challenge - semi-streaming: With O(n polylog n) space (where n = vertices), you can do much more:
- Connectivity exactly (via spanning forest)
- Approximate maximum matching
- Spanners
The insight: Even with severe memory constraints, graph algorithms are possible using clever sampling and sketching techniques.
Information diet management: We live in an era of information overload—news feeds, emails, social media streams. Streaming algorithms suggest strategies:
Reservoir sampling for decision-making: When facing too many options, maintain a small random sample. This gives unbiased representation without overwhelming cognitive load.
Heavy hitters for attention: Focus on high-frequency patterns. Don’t track every detail; identify what matters most.
Sliding windows for recency: Recent information is often most relevant. Exponential decay in memory mirrors how streaming sliding window algorithms weigh recent events more heavily.
Maintain uniform random sample of fixed size from unbounded stream.
- Reservoir sampling: Classic algorithm, beautiful proof
- Applications: A/B testing, data profiling, debugging
Represent data as vector; multiply by random matrix to compress.
- Johnson-Lindenstrauss: Random projection preserves distances
- Count-Min Sketch: Multiple hash functions create linear sketch
- Applications: Dimensionality reduction, compressed sensing
Use hash functions to map elements to small space; aggregate in small space.
- HyperLogLog: Hash to extract randomness, aggregate statistics
- Bloom filters: Multiple hash functions for membership
- Applications: Duplicate detection, caching, databases
Detect important elements by sampling; maintain only elements above threshold.
- Misra-Gries: Deterministic heavy hitters via counters
- Sticky sampling: Probabilistic sampling with threshold
- Applications: Network monitoring, data mining
Why streaming is hard: Communication complexity provides streaming algorithm lower bounds. If two parties need to exchange many bits to solve a problem, a streaming algorithm needs similar space.
Classic results:
- Distinct elements: Ω(1/ε² + log n) space required
- Frequency moments: Ω(n^(1-2/k)) space for k-th moment
- Graph connectivity: Ω(n) space in adversarial order
Why this matters: Lower bounds prove our algorithms are (nearly) optimal. We can’t do fundamentally better without stronger assumptions.
- Muthukrishnan “Data Streams: Algorithms and Applications” - Classic survey/monograph
- Woodruff “Sketching as a Tool for Numerical Linear Algebra” - Comprehensive modern treatment
- Cormode “Sketch Techniques for Approximate Query Processing” - Practical perspective
- CMU 15-859 “Algorithms for Big Data” - David Woodruff’s course (excellent, comprehensive)
- Dartmouth CS 49/149 - Amit Chakrabarti’s widely-used notes
- Harvard CS 225 - Jelani Nelson’s course
- MIT 6.890 - Various topics courses on streaming
- Alon, Matias, Szegedy “The Space Complexity of Approximating the Frequency Moments” (1996) - The founding paper (Gödel Prize)
- Flajolet et al. “HyperLogLog” (2007) - The practical cardinality estimator
- Cormode & Muthukrishnan “An Improved Data Stream Summary: The Count-Min Sketch” (2005)
- Chakrabarti “Communication Complexity” - For lower bounds
- Sketching Algorithms MOOC - Free online course
- DataSketches.io - Open-source library with implementations
- Randomized Algorithms: Streaming relies fundamentally on randomization for compression
- Communication Complexity: Provides lower bounds via reductions
- Approximation Algorithms: Trade accuracy for resources (memory instead of time)
- Compressed Sensing: Linear sketching connects to signal processing theory
- Machine Learning: Stochastic gradient descent is streaming optimization; reservoir sampling for training
- Databases: Approximate query processing uses streaming techniques
- Network Algorithms: Monitoring, traffic engineering, DDoS detection
- Systems: Cache management, log analysis, monitoring
Streaming Algorithm Library
- Implement core streaming algorithms (HyperLogLog, Count-Min, Bloom filters)
- Provide clean API for common tasks
- Benchmark against exact algorithms
- Visualize memory-accuracy tradeoffs
Network Traffic Monitor
- Process packet stream in real-time
- Detect heavy hitter flows (Count-Min Sketch)
- Count unique IPs (HyperLogLog)
- Identify anomalies (significant deviations)
- Dashboard showing statistics
Twitter Stream Analyzer
- Process Twitter firehose or Reddit stream
- Track trending hashtags (heavy hitters)
- Count unique users (cardinality)
- Detect bursty topics (sliding window)
- Compare streaming vs. exact computation
Graph Stream Processor
- Implement semi-streaming graph algorithms
- Test connectivity, matching on large graphs
- Compare memory usage to traditional algorithms
- Visualize sketch evolution
Cardinality Estimation Shootout
- Implement multiple cardinality estimators (HyperLogLog, BJKST, sampling)
- Test accuracy vs. memory tradeoffs
- Apply to real datasets
- Analyze performance characteristics
Sliding Window Statistics
- Implement exponentially decaying histograms
- Maintain approximate statistics over sliding windows
- Apply to time-series data
- Compare to exact windowing
3-5 lectures for comprehensive coverage
- 2 lectures: Core concepts (sampling, frequency estimation, cardinality)
- 3 lectures: + Linear sketching, graph streaming
- 4 lectures: + Communication complexity lower bounds
- 5 lectures: + Advanced topics (sliding windows, distributed streaming)
“Streaming algorithms are just approximations”
- Subtle! Some streaming algorithms are exact (e.g., connectivity with enough space). The paradigm is about memory constraints, not necessarily approximation.
“More memory always helps”
- True but with diminishing returns. Often logarithmic space gets you close; polynomial space gets you everything. The interesting regime is sublinear space.
“Streaming algorithms are only for ‘big data’”
- False! They’re useful whenever you want low memory footprint—embedded systems, real-time processing, distributed systems with communication constraints.
“Hash collisions break everything”
- False! Streaming algorithms are designed to handle collisions gracefully. Analysis accounts for collision probability.
“Single pass means no preprocessing”
- False! You can use preprocessing (e.g., choose random hash functions, initialize data structures). Single pass refers to reading the stream once.
“Streaming algorithms are a solved problem”
- False! Active research area with open problems—graph algorithms, distributed streaming, adaptive adversaries, learning-augmented streaming.
Traditional (banking model):
“Here’s the streaming model definition. Here’s the Count-Min Sketch data structure. Memorize the update and query procedures. Here’s the error bound proof.”
Problem-posing approach:
“You’re processing billions of IP addresses in real-time. You need to detect which IPs send >1% of traffic. You only have a few kilobytes of memory. What do you do? Storing per-IP counts won’t fit… What if we use hash tables? They still grow with unique IPs… What if we use multiple hash tables and take the minimum? Why might that work? Let’s analyze the collisions…”
The difference: Students confront the impossibility of traditional approaches at scale, then discover streaming techniques as the solution. The memory constraint isn’t an academic exercise—it’s the fundamental challenge that motivates the entire field.
Curriculum placement:
- Week 8-10 for graduate courses (after randomized algorithms)
- Often dedicated advanced course or seminar
- Can integrate into “big data” or “data mining” courses
Prerequisites check: Students need solid randomized algorithms background and comfort with probabilistic analysis. Tail bounds (Chernoff) essential.
Why late course: Requires significant mathematical sophistication. Builds on randomization, hashing, probability. Best after foundational techniques established.
Learning-Augmented Streaming: Combining ML predictions with worst-case guarantees. If predictions are good, achieve better performance; if predictions fail, maintain worst-case bounds. Active research area bridging ML and theory.
Graph Neural Networks and Streaming: Processing graph streams for ML applications—community detection, link prediction, node classification—all in streaming setting.
Distributed and Parallel Streaming: Streaming in MapReduce-style frameworks. Algorithms must handle data partitioned across machines with limited communication.
Differential Privacy in Streams: Maintaining privacy guarantees while processing sensitive data streams. Fundamental challenge: how to release accurate statistics without revealing individual data points.
Adversarial Streams: Beyond random order or worst-case static order—adaptive adversaries that see algorithm state and choose next elements to maximize error. Models realistic adversarial settings (network attacks, adversarial ML).
Quantum Streaming: Can quantum algorithms provide exponential space savings for streaming problems? Some evidence yes (quantum sketching), but many open questions.
Industry usage:
- Google: HyperLogLog for analytics, Bloom filters for BigTable
- Facebook: Streaming aggregation for real-time metrics
- Amazon: Cardinality estimation for marketplace analytics
- Twitter: Heavy hitters for trending topics
- DataDog/Splunk: Log analysis and monitoring using sketches
Open-source libraries:
- DataSketches (Apache): Production-quality streaming algorithms
- Redis HyperLogLog: Built-in cardinality estimation
- Apache Spark: Streaming API with approximate aggregation
Ready for students? ✓ Yes - addresses critical modern computing challenge, rigorous mathematical foundations, strong practical applications, clear motivation from impossibility of traditional approaches at scale. Advanced but essential topic for data-intensive computing.
