Parallel & Distributed Algorithms
Algorithms designed to leverage multiple processors or machines working simultaneously. Parallel algorithms exploit multiple cores/processors on a single machine with shared memory, while distributed algorithms coordinate across independent machines communicating via message passing. This field addresses the fundamental question: what problems become easier with more computational resources working together, and what inherent barriers prevent perfect parallelization?
Intermediate to Advanced
- Basic parallel concepts (speedup, efficiency, Amdahl’s Law): Intermediate
- PRAM model and parallel algorithms: Intermediate to Advanced
- Distributed consensus and coordination: Advanced
- MapReduce and bulk synchronous parallel (BSP): Intermediate
- Graph algorithms in distributed settings: Advanced
- Lower bounds and communication complexity: Advanced
The serial era has ended: For decades, computers got faster via increasing clock speeds. That ended around 2005 due to physical limits (heat, power). Now performance comes from parallelism—multiple cores, GPUs, clusters, cloud computing. Modern laptops have 8+ cores. Servers have hundreds. Datacenters have millions.
Every algorithm you write will run on parallel hardware. Understanding parallelism isn’t optional—it’s essential for utilizing modern computing.
War Story - The Human Genome Project: Sequencing the first human genome (completed 2003) took 13 years and $2.7 billion. Today, sequencing a genome takes hours and costs $600. The dominant factor enabling this transformation: massively parallel algorithms processing millions of DNA fragments simultaneously.
The computational pipeline involves:
- Parallel sequence alignment across fragments
- Distributed assembly of genome from overlaps
- Parallel variant calling and analysis
Without parallel algorithms, modern genomics—personalized medicine, cancer research, evolutionary biology—wouldn’t exist at current scales.
War Story - Google’s PageRank: Google’s original breakthrough was PageRank, an eigenvector computation on a graph of billions of web pages. Computing this sequentially would take weeks. Google’s solution: MapReduce, a parallel programming model that automatically distributes computation across thousands of machines.
MapReduce transformed big data processing. Hadoop, Spark, and cloud computing all descend from Google’s parallel computation framework. The insight: structure algorithms around parallel map and reduce operations that can execute independently.
War Story - AlphaGo and Deep Learning: AlphaGo’s victory over the world Go champion required training neural networks on millions of games processed across 50 GPUs and 1,200 CPUs. Training time: weeks. Without parallelism: decades.
Modern deep learning is fundamentally parallel:
- Training parallelizes across data (mini-batches)
- Model parallelism distributes network layers
- GPU acceleration provides 100x speedups
AI’s recent renaissance is largely a story of parallel computing enabling previously infeasible computations.
- Parallel Prefix Sum (Scan) - Fundamental building block for parallel algorithms
- Parallel Sorting (Merge Sort, Sample Sort) - Divide-and-conquer parallelization
- Parallel Matrix Multiplication - Strassen with parallelism
- Parallel Graph BFS/DFS - Level-synchronous BFS
- Work-Span Analysis - Understanding parallelism and critical path
- MapReduce - Google’s parallel programming model
- Distributed Consensus (Paxos, Raft) - Agreement among distributed nodes
- Distributed Shortest Paths (Bellman-Ford parallelization)
- Distributed Graph Algorithms - Connected components, MST
- Gossip Protocols - Randomized information dissemination
- Byzantine Agreement - Consensus with adversarial failures
- Distributed Hash Tables (DHT) - Scalable key-value storage (Chord, Kademlia)
- Pregel/Graph Processing - Vertex-centric distributed graph computation
- Bulk Synchronous Parallel (BSP) Model - Structured parallel computation
- Communication-Avoiding Algorithms - Minimizing data movement
Required:
- Basic algorithms (sorting, graph algorithms)
- Complexity analysis and Big-O notation
- Basic probability (for randomized algorithms)
Helpful:
- Understanding of computer architecture (caches, memory hierarchy)
- Concurrent programming concepts
- Graph theory
- Linear algebra (for matrix algorithms)
Setup: You have 1 billion records to sort and 1,000 machines. Can you sort 1,000× faster?
Naive parallelization: Divide data into 1,000 chunks, sort each chunk independently, then merge. But merging 1,000 sorted sequences is expensive!
Better approach - Sample Sort:
- Sample elements from entire dataset
- Use samples to choose splitters that partition data evenly
- Each machine gets one partition, sorts locally
- Concatenate results (already in global order!)
Analysis:
- Work: O(n log n) total (same as sequential)
- Span: O(n/p log(n/p) + n/p log p) on p processors
- Speedup: Nearly linear for large n!
The insight: Efficient parallel algorithms require rethinking sequential approaches. The bottleneck shifts from computation to communication and coordination.
Setup: A bank account starts with $1,000. Two ATMs simultaneously process withdrawals of $900. Both check the balance (see $1,000), both approve, both deduct $900. Account ends at $100, but $1,800 was withdrawn!
The challenge: With distributed systems, simple operations become complex:
- How do we ensure consistency?
- How do we coordinate without central authority?
- What if machines crash during operations?
- What if network messages are lost or delayed?
Solutions:
- Two-phase commit: Coordinate via prepare/commit protocol
- Consensus algorithms: Paxos, Raft for fault-tolerant agreement
- Eventual consistency: Relax immediate consistency for availability
The insight: Distributed algorithms must handle asynchrony, failures, and network partitions. The CAP theorem formalizes fundamental tradeoffs: you can’t have Consistency, Availability, and Partition tolerance simultaneously.
Setup: Given a massive graph with billions of vertices and edges, find all connected components.
Sequential: BFS/DFS from each unvisited vertex. Time: O(V + E).
Parallel challenge: Multiple BFS traversals interfere—how do we coordinate?
Parallel approaches:
- Label propagation: Each vertex maintains component ID, propagates to neighbors, repeats until convergence
- Edge contraction: Parallel edge contractions (like Karger but deterministic)
- MapReduce: Iterative propagation via map-reduce rounds
The challenge: Graph problems have inherent sequential dependencies. How much parallelism is possible?
Analysis: With p processors, speedup is often O(log n) factor worse than ideal due to synchronization. Some problems (e.g., circuit evaluation) have NC (fast parallel) algorithms; others (e.g., lexicographically first DFS) seem inherently sequential.
Parallel thinking in daily life: Understanding parallelism changes how we approach tasks:
Identifying parallelizable work:
- Cooking multiple dishes: What can be prepared simultaneously?
- Project management: Which tasks can team members do in parallel?
- Learning: Can you study multiple subjects concurrently?
Recognizing sequential bottlenecks (Amdahl’s Law): Even if 90% of work parallelizes perfectly, the remaining 10% sequential portion limits speedup to 10×. This explains:
- Why adding more people to late projects makes them later (Brooks’s Law)
- Why some tasks can’t be rushed regardless of resources
- The importance of identifying and minimizing critical path
Coordination overhead: More workers means more coordination cost:
- Large teams spend more time in meetings
- Communication overhead scales with team size
- Sometimes smaller focused teams outperform larger ones
- NC (Nick’s Class): Problems solvable in polylog time with polynomial processors
- P-Complete: Problems believed inherently sequential (probably not in NC)
- Speedup Theorem: Maximum possible speedup from parallelization
1. PRAM (Parallel Random Access Machine):
- Idealized model with shared memory
- Variants: EREW, CREW, CRCW (exclusive/concurrent read/write)
- Useful for algorithm design despite unrealistic assumptions
2. Work-Span Model:
- Work: Total operations across all processors
- Span: Length of longest dependency chain (critical path)
- Parallelism: Work/Span (maximum possible speedup)
3. Bulk Synchronous Parallel (BSP):
- Computation proceeds in supersteps
- Each superstep: local computation + communication + barrier
- Models communication cost explicitly
4. MapReduce Model:
- Computation structured as map and reduce phases
- Automatic distribution and fault tolerance
- Practical model for large-scale data processing
Amdahl’s Law: If fraction s is inherently sequential: Maximum Speedup = 1 / (s + (1-s)/p)
Even with infinite processors (p→∞), speedup is bounded by 1/s.
Work-Span Lower Bound: Running time on p processors: T_p ≥ max(Work/p, Span)
You can’t beat Work/p (not enough parallelism) or Span (sequential dependencies).
Communication Complexity: For many algorithms, communication dominates computation at scale. Communication-avoiding algorithms minimize data movement.
In asynchronous distributed systems with even one possible failure, no deterministic algorithm can guarantee consensus (Fischer, Lynch, Paterson, 1985).
Implication: Distributed consensus requires either synchrony assumptions, randomization, or eventual consistency.
A distributed system can provide at most two of:
- Consistency: All nodes see the same data
- Availability: System always responds
- Partition tolerance: System functions despite network failures
Real-world impact: This theorem explains why distributed databases make different tradeoffs (SQL vs. NoSQL), and why perfect distributed systems are impossible.
Can we reach consensus when some nodes are actively malicious? Yes, but requires 3f+1 nodes to tolerate f Byzantine failures.
Applications: Blockchain consensus, secure distributed systems, spacecraft control systems.
- JáJá “An Introduction to Parallel Algorithms” - Comprehensive PRAM algorithms
- Lynch “Distributed Algorithms” - Definitive distributed systems theory
- Attiya & Welch “Distributed Computing” - Modern treatment
- Lin & Dyer “Data-Intensive Text Processing with MapReduce” - Practical parallel programming
- Herlihy & Shavit “The Art of Multiprocessor Programming” - Shared-memory parallelism
- MIT 6.854J “Advanced Algorithms” - Includes parallel algorithms coverage
- CMU 15-418/15-618 “Parallel Computer Architecture and Programming” - Excellent practical course
- Berkeley CS 267 “Applications of Parallel Computers” - HPC perspective
- Stanford CS 347 “Parallel and Distributed Data Management”
- EPFL Distributed Algorithms MOOC - Free online course
- Dean & Ghemawat “MapReduce” (OSDI 2004) - The foundational paper
- Lamport “Paxos Made Simple” (2001) - Classic consensus algorithm
- Fischer, Lynch, Paterson “Impossibility of Distributed Consensus” (1985) - FLP result
- Valiant “A Bridging Model for Parallel Computation” (1990) - BSP model
- Malewicz et al. “Pregel: A System for Large-Scale Graph Processing” (SIGMOD 2010)
- Apache Spark - Modern successor to Hadoop MapReduce
- Apache Flink - Stream and batch processing
- Dask - Parallel computing in Python
- MPI (Message Passing Interface) - Standard for HPC
- Divide-and-Conquer: Naturally parallelizable structure
- Graph Algorithms: Many graph problems have parallel variants
- Streaming Algorithms: Distributed streaming combines both paradigms
- Randomized Algorithms: Randomization often helps in distributed settings
- Approximation Algorithms: Sometimes relaxing exactness enables better parallelism
- Machine Learning: SGD, data parallelism, model parallelism
- Systems: Operating systems, databases, distributed systems
- Cryptography: Byzantine agreement, blockchain consensus
Parallel Sorting Shootout
- Implement multiple parallel sorting algorithms
- Benchmark on multicore CPU
- Measure speedup vs. sequential versions
- Analyze scalability bottlenecks
MapReduce Framework
- Build simplified MapReduce implementation
- Support word count, PageRank, etc.
- Simulate distributed execution
- Visualize data flow and parallelism
Distributed Consensus Simulator
- Implement Paxos or Raft consensus
- Simulate network delays, failures
- Visualize message passing and state changes
- Test fault tolerance properties
Graph Processing System
- Implement vertex-centric graph processing (Pregel-style)
- Parallelize PageRank, connected components
- Compare to sequential implementations
- Analyze communication patterns
Parallel Machine Learning
- Implement data-parallel SGD
- Train model on multicore CPU or GPU
- Measure speedup vs. sequential training
- Analyze convergence with parallelism
Work-Span Analyzer
- Build tool to analyze parallel algorithms
- Visualize computation DAG
- Calculate work, span, parallelism
- Identify critical path bottlenecks
3-5 lectures for comprehensive coverage
- 2 lectures: Parallel algorithms basics, PRAM model, work-span analysis
- 3 lectures: + MapReduce, BSP, practical parallel algorithms
- 4 lectures: + Distributed consensus, coordination, CAP theorem
- 5 lectures: + Advanced topics (Byzantine agreement, communication-avoiding)
“More processors always means faster”
- False! Amdahl’s Law shows sequential bottlenecks limit speedup. Communication overhead can make parallel algorithms slower.
“Parallelizing means dividing work equally”
- Oversimplified! Load balancing, communication patterns, and dependencies matter more than equal division.
“Parallel algorithms are just sequential algorithms run in parallel”
- False! Often requires fundamentally different algorithmic approaches. See sample sort vs. sequential sorting.
“Distributed systems are just slow parallel systems”
- False! Distributed systems face fundamentally different challenges—partial failures, asynchrony, Byzantine faults, network partitions.
“Adding more machines solves all scalability problems”
- False! Communication and coordination costs eventually dominate. Some problems have inherent sequential dependencies.
“Parallel programming is just using threads”
- Oversimplified! Requires algorithmic rethinking, understanding memory models, avoiding race conditions, and managing communication.
“Consensus is easy in distributed systems”
- False! FLP impossibility shows it’s fundamentally hard. Real systems make tradeoffs (synchrony assumptions, eventual consistency).
Traditional (banking model):
“Here’s the PRAM model. Here are the EREW/CREW/CRCW variants. Memorize parallel prefix sum. Here’s work-span analysis. Apply it.”
Problem-posing approach:
“You have 1,000 machines and a billion records to sort. How can you use all those machines effectively? Try splitting the data… but now you need to merge 1,000 sorted lists—that’s expensive! What if we could avoid merging? What if we partition differently using samples? How do we ensure even load balancing? Let’s formalize what ‘speedup’ means…”
The difference: Students confront the challenge of utilizing parallel resources effectively, discover that naive parallelization often fails, and develop principled approaches to parallel algorithm design. The models emerge as tools for reasoning about parallelism, not arbitrary definitions.
Curriculum placement:
- Week 8-12 for graduate courses (after foundational algorithms)
- Often dedicated course on parallel/distributed algorithms
- Can integrate throughout (parallel divide-and-conquer early, distributed late)
Prerequisites check: Students need solid algorithm analysis background. Understanding of systems/architecture helpful but not required.
Why late course: Builds on foundational algorithms while adding complexity of coordination and communication. Requires sophisticated analysis combining algorithm design and performance modeling.
Communication-Avoiding Algorithms: Minimizing data movement is crucial for performance on modern hardware (cache hierarchies, distributed memory). Algorithms optimized for communication complexity can achieve order-of-magnitude speedups.
Learning-Augmented Parallel Algorithms: Using ML predictions to guide scheduling, load balancing, and resource allocation in parallel systems. Combines worst-case guarantees with learned optimizations.
Heterogeneous Parallelism: Modern systems combine CPUs, GPUs, FPGAs, specialized accelerators (TPUs). Algorithms must exploit diverse computational resources with different performance characteristics.
Fault Tolerance at Scale: With millions of processors, failures are common, not exceptional. Algorithms must handle transient faults, stragglers, and permanent failures gracefully.
Quantum Parallelism: Quantum algorithms exploit quantum parallelism—evaluating functions on superpositions of inputs. Fundamentally different from classical parallelism.
Green Computing: Power efficiency is critical. Parallel algorithms optimized for energy consumption, not just speed. Tradeoffs between parallelism and energy usage.
Modern computing is parallel computing:
- Smartphones: Multi-core CPUs, GPUs for graphics and ML
- Web services: Distributed across datacenters globally
- Scientific computing: Supercomputers with millions of cores
- Cloud computing: Elastic parallelism via VMs and containers
- AI/ML: Training requires GPUs and distributed systems
Every significant computational task today involves parallelism. Understanding parallel and distributed algorithms is essential for modern software engineering, data science, and systems building.
Ready for students? ✓ Yes - addresses fundamental shift in computing (end of serial performance scaling), bridges theory and practice, strong motivation from ubiquity of parallel hardware, rigorous foundations with practical applications, prepares students for modern computing reality.
