Parallel & Distributed Algorithms

Overview

Algorithms designed to leverage multiple processors or machines working simultaneously. Parallel algorithms exploit multiple cores/processors on a single machine with shared memory, while distributed algorithms coordinate across independent machines communicating via message passing. This field addresses the fundamental question: what problems become easier with more computational resources working together, and what inherent barriers prevent perfect parallelization?

Difficulty Level

Intermediate to Advanced

Basic parallel concepts (speedup, efficiency, Amdahl’s Law): Intermediate
PRAM model and parallel algorithms: Intermediate to Advanced
Distributed consensus and coordination: Advanced
MapReduce and bulk synchronous parallel (BSP): Intermediate
Graph algorithms in distributed settings: Advanced
Lower bounds and communication complexity: Advanced

Why This Matters

The serial era has ended: For decades, computers got faster via increasing clock speeds. That ended around 2005 due to physical limits (heat, power). Now performance comes from parallelism—multiple cores, GPUs, clusters, cloud computing. Modern laptops have 8+ cores. Servers have hundreds. Datacenters have millions.

Every algorithm you write will run on parallel hardware. Understanding parallelism isn’t optional—it’s essential for utilizing modern computing.

War Story - The Human Genome Project: Sequencing the first human genome (completed 2003) took 13 years and $2.7 billion. Today, sequencing a genome takes hours and costs $600. The dominant factor enabling this transformation: massively parallel algorithms processing millions of DNA fragments simultaneously.

The computational pipeline involves:

Parallel sequence alignment across fragments
Distributed assembly of genome from overlaps
Parallel variant calling and analysis

Without parallel algorithms, modern genomics—personalized medicine, cancer research, evolutionary biology—wouldn’t exist at current scales.

War Story - Google’s PageRank: Google’s original breakthrough was PageRank, an eigenvector computation on a graph of billions of web pages. Computing this sequentially would take weeks. Google’s solution: MapReduce, a parallel programming model that automatically distributes computation across thousands of machines.

MapReduce transformed big data processing. Hadoop, Spark, and cloud computing all descend from Google’s parallel computation framework. The insight: structure algorithms around parallel map and reduce operations that can execute independently.

War Story - AlphaGo and Deep Learning: AlphaGo’s victory over the world Go champion required training neural networks on millions of games processed across 50 GPUs and 1,200 CPUs. Training time: weeks. Without parallelism: decades.

Modern deep learning is fundamentally parallel:

Training parallelizes across data (mini-batches)
Model parallelism distributes network layers
GPU acceleration provides 100x speedups

AI’s recent renaissance is largely a story of parallel computing enabling previously infeasible computations.

Core Algorithms

Parallel Algorithms

Parallel Prefix Sum (Scan) - Fundamental building block for parallel algorithms
Parallel Sorting (Merge Sort, Sample Sort) - Divide-and-conquer parallelization
Parallel Matrix Multiplication - Strassen with parallelism
Parallel Graph BFS/DFS - Level-synchronous BFS
Work-Span Analysis - Understanding parallelism and critical path

Distributed Algorithms

MapReduce - Google’s parallel programming model
Distributed Consensus (Paxos, Raft) - Agreement among distributed nodes
Distributed Shortest Paths (Bellman-Ford parallelization)
Distributed Graph Algorithms - Connected components, MST
Gossip Protocols - Randomized information dissemination

Advanced Topics

Byzantine Agreement - Consensus with adversarial failures
Distributed Hash Tables (DHT) - Scalable key-value storage (Chord, Kademlia)
Pregel/Graph Processing - Vertex-centric distributed graph computation
Bulk Synchronous Parallel (BSP) Model - Structured parallel computation
Communication-Avoiding Algorithms - Minimizing data movement

Prerequisites

Required:
- Basic algorithms (sorting, graph algorithms)
- Complexity analysis and Big-O notation
- Basic probability (for randomized algorithms)
Helpful:
- Understanding of computer architecture (caches, memory hierarchy)
- Concurrent programming concepts
- Graph theory
- Linear algebra (for matrix algorithms)

Problem-Posing Approaches

Story 1: Sorting a Billion Records - Can We Use 1,000 Machines?

Setup: You have 1 billion records to sort and 1,000 machines. Can you sort 1,000× faster?

Naive parallelization: Divide data into 1,000 chunks, sort each chunk independently, then merge. But merging 1,000 sorted sequences is expensive!

Better approach - Sample Sort:

Sample elements from entire dataset
Use samples to choose splitters that partition data evenly
Each machine gets one partition, sorts locally
Concatenate results (already in global order!)

Analysis:

Work: O(n log n) total (same as sequential)
Span: O(n/p log(n/p) + n/p log p) on p processors
Speedup: Nearly linear for large n!

The insight: Efficient parallel algorithms require rethinking sequential approaches. The bottleneck shifts from computation to communication and coordination.

Story 2: The Distributed Bank Account Problem

Setup: A bank account starts with $1,000. Two ATMs simultaneously process withdrawals of $900. Both check the balance (see $1,000), both approve, both deduct $900. Account ends at $100, but $1,800 was withdrawn!

The challenge: With distributed systems, simple operations become complex:

How do we ensure consistency?
How do we coordinate without central authority?
What if machines crash during operations?
What if network messages are lost or delayed?

Solutions:

Two-phase commit: Coordinate via prepare/commit protocol
Consensus algorithms: Paxos, Raft for fault-tolerant agreement
Eventual consistency: Relax immediate consistency for availability

The insight: Distributed algorithms must handle asynchrony, failures, and network partitions. The CAP theorem formalizes fundamental tradeoffs: you can’t have Consistency, Availability, and Partition tolerance simultaneously.

Story 3: Finding Connected Components - Parallel vs. Sequential

Setup: Given a massive graph with billions of vertices and edges, find all connected components.

Sequential: BFS/DFS from each unvisited vertex. Time: O(V + E).

Parallel challenge: Multiple BFS traversals interfere—how do we coordinate?

Parallel approaches:

Label propagation: Each vertex maintains component ID, propagates to neighbors, repeats until convergence
Edge contraction: Parallel edge contractions (like Karger but deterministic)
MapReduce: Iterative propagation via map-reduce rounds

The challenge: Graph problems have inherent sequential dependencies. How much parallelism is possible?

Analysis: With p processors, speedup is often O(log n) factor worse than ideal due to synchronization. Some problems (e.g., circuit evaluation) have NC (fast parallel) algorithms; others (e.g., lexicographically first DFS) seem inherently sequential.

Cognitive Applications

Parallel thinking in daily life: Understanding parallelism changes how we approach tasks:

Identifying parallelizable work:

Cooking multiple dishes: What can be prepared simultaneously?
Project management: Which tasks can team members do in parallel?
Learning: Can you study multiple subjects concurrently?

Recognizing sequential bottlenecks (Amdahl’s Law): Even if 90% of work parallelizes perfectly, the remaining 10% sequential portion limits speedup to 10×. This explains:

Why adding more people to late projects makes them later (Brooks’s Law)
Why some tasks can’t be rushed regardless of resources
The importance of identifying and minimizing critical path

Coordination overhead: More workers means more coordination cost:

Large teams spend more time in meetings
Communication overhead scales with team size
Sometimes smaller focused teams outperform larger ones

Theoretical Foundations

Parallel Complexity Classes

NC (Nick’s Class): Problems solvable in polylog time with polynomial processors
P-Complete: Problems believed inherently sequential (probably not in NC)
Speedup Theorem: Maximum possible speedup from parallelization

Models of Parallel Computation

1. PRAM (Parallel Random Access Machine):

Idealized model with shared memory
Variants: EREW, CREW, CRCW (exclusive/concurrent read/write)
Useful for algorithm design despite unrealistic assumptions

2. Work-Span Model:

Work: Total operations across all processors
Span: Length of longest dependency chain (critical path)
Parallelism: Work/Span (maximum possible speedup)

3. Bulk Synchronous Parallel (BSP):

Computation proceeds in supersteps
Each superstep: local computation + communication + barrier
Models communication cost explicitly

4. MapReduce Model:

Computation structured as map and reduce phases
Automatic distribution and fault tolerance
Practical model for large-scale data processing

Fundamental Limits

Amdahl’s Law: If fraction s is inherently sequential: Maximum Speedup = 1 / (s + (1-s)/p)

Even with infinite processors (p→∞), speedup is bounded by 1/s.

Work-Span Lower Bound: Running time on p processors: T_p ≥ max(Work/p, Span)

You can’t beat Work/p (not enough parallelism) or Span (sequential dependencies).

Communication Complexity: For many algorithms, communication dominates computation at scale. Communication-avoiding algorithms minimize data movement.

Distributed Systems Challenges

The FLP Impossibility Result

In asynchronous distributed systems with even one possible failure, no deterministic algorithm can guarantee consensus (Fischer, Lynch, Paterson, 1985).

Implication: Distributed consensus requires either synchrony assumptions, randomization, or eventual consistency.

The CAP Theorem

A distributed system can provide at most two of:

Consistency: All nodes see the same data
Availability: System always responds
Partition tolerance: System functions despite network failures

Real-world impact: This theorem explains why distributed databases make different tradeoffs (SQL vs. NoSQL), and why perfect distributed systems are impossible.

Byzantine Fault Tolerance

Can we reach consensus when some nodes are actively malicious? Yes, but requires 3f+1 nodes to tolerate f Byzantine failures.

Applications: Blockchain consensus, secure distributed systems, spacecraft control systems.

Key Resources

Textbooks

JáJá “An Introduction to Parallel Algorithms” - Comprehensive PRAM algorithms
Lynch “Distributed Algorithms” - Definitive distributed systems theory
Attiya & Welch “Distributed Computing” - Modern treatment
Lin & Dyer “Data-Intensive Text Processing with MapReduce” - Practical parallel programming
Herlihy & Shavit “The Art of Multiprocessor Programming” - Shared-memory parallelism

Lecture Notes & Courses

MIT 6.854J “Advanced Algorithms” - Includes parallel algorithms coverage
CMU 15-418/15-618 “Parallel Computer Architecture and Programming” - Excellent practical course
Berkeley CS 267 “Applications of Parallel Computers” - HPC perspective
Stanford CS 347 “Parallel and Distributed Data Management”
EPFL Distributed Algorithms MOOC - Free online course

Papers & Surveys

Dean & Ghemawat “MapReduce” (OSDI 2004) - The foundational paper
Lamport “Paxos Made Simple” (2001) - Classic consensus algorithm
Fischer, Lynch, Paterson “Impossibility of Distributed Consensus” (1985) - FLP result
Valiant “A Bridging Model for Parallel Computation” (1990) - BSP model
Malewicz et al. “Pregel: A System for Large-Scale Graph Processing” (SIGMOD 2010)

Practical Systems

Apache Spark - Modern successor to Hadoop MapReduce
Apache Flink - Stream and batch processing
Dask - Parallel computing in Python
MPI (Message Passing Interface) - Standard for HPC

Connections to Other Topics

Divide-and-Conquer: Naturally parallelizable structure
Graph Algorithms: Many graph problems have parallel variants
Streaming Algorithms: Distributed streaming combines both paradigms
Randomized Algorithms: Randomization often helps in distributed settings
Approximation Algorithms: Sometimes relaxing exactness enables better parallelism
Machine Learning: SGD, data parallelism, model parallelism
Systems: Operating systems, databases, distributed systems
Cryptography: Byzantine agreement, blockchain consensus

Possible Projects

Parallel Sorting Shootout
- Implement multiple parallel sorting algorithms
- Benchmark on multicore CPU
- Measure speedup vs. sequential versions
- Analyze scalability bottlenecks
MapReduce Framework
- Build simplified MapReduce implementation
- Support word count, PageRank, etc.
- Simulate distributed execution
- Visualize data flow and parallelism
Distributed Consensus Simulator
- Implement Paxos or Raft consensus
- Simulate network delays, failures
- Visualize message passing and state changes
- Test fault tolerance properties
Graph Processing System
- Implement vertex-centric graph processing (Pregel-style)
- Parallelize PageRank, connected components
- Compare to sequential implementations
- Analyze communication patterns
Parallel Machine Learning
- Implement data-parallel SGD
- Train model on multicore CPU or GPU
- Measure speedup vs. sequential training
- Analyze convergence with parallelism
Work-Span Analyzer
- Build tool to analyze parallel algorithms
- Visualize computation DAG
- Calculate work, span, parallelism
- Identify critical path bottlenecks

Estimated Coverage Time

3-5 lectures for comprehensive coverage

2 lectures: Parallel algorithms basics, PRAM model, work-span analysis
3 lectures: + MapReduce, BSP, practical parallel algorithms
4 lectures: + Distributed consensus, coordination, CAP theorem
5 lectures: + Advanced topics (Byzantine agreement, communication-avoiding)

Common Misconceptions

“More processors always means faster”
- False! Amdahl’s Law shows sequential bottlenecks limit speedup. Communication overhead can make parallel algorithms slower.
“Parallelizing means dividing work equally”
- Oversimplified! Load balancing, communication patterns, and dependencies matter more than equal division.
“Parallel algorithms are just sequential algorithms run in parallel”
- False! Often requires fundamentally different algorithmic approaches. See sample sort vs. sequential sorting.
“Distributed systems are just slow parallel systems”
- False! Distributed systems face fundamentally different challenges—partial failures, asynchrony, Byzantine faults, network partitions.
“Adding more machines solves all scalability problems”
- False! Communication and coordination costs eventually dominate. Some problems have inherent sequential dependencies.
“Parallel programming is just using threads”
- Oversimplified! Requires algorithmic rethinking, understanding memory models, avoiding race conditions, and managing communication.
“Consensus is easy in distributed systems”
- False! FLP impossibility shows it’s fundamentally hard. Real systems make tradeoffs (synchrony assumptions, eventual consistency).

What Makes This Problem-Posing?

Traditional (banking model):

“Here’s the PRAM model. Here are the EREW/CREW/CRCW variants. Memorize parallel prefix sum. Here’s work-span analysis. Apply it.”

Problem-posing approach:

“You have 1,000 machines and a billion records to sort. How can you use all those machines effectively? Try splitting the data… but now you need to merge 1,000 sorted lists—that’s expensive! What if we could avoid merging? What if we partition differently using samples? How do we ensure even load balancing? Let’s formalize what ‘speedup’ means…”

The difference: Students confront the challenge of utilizing parallel resources effectively, discover that naive parallelization often fails, and develop principled approaches to parallel algorithm design. The models emerge as tools for reasoning about parallelism, not arbitrary definitions.

When to Cover This

Curriculum placement:

Week 8-12 for graduate courses (after foundational algorithms)
Often dedicated course on parallel/distributed algorithms
Can integrate throughout (parallel divide-and-conquer early, distributed late)

Prerequisites check: Students need solid algorithm analysis background. Understanding of systems/architecture helpful but not required.

Why late course: Builds on foundational algorithms while adding complexity of coordination and communication. Requires sophisticated analysis combining algorithm design and performance modeling.

Modern Research Frontiers

Communication-Avoiding Algorithms: Minimizing data movement is crucial for performance on modern hardware (cache hierarchies, distributed memory). Algorithms optimized for communication complexity can achieve order-of-magnitude speedups.

Learning-Augmented Parallel Algorithms: Using ML predictions to guide scheduling, load balancing, and resource allocation in parallel systems. Combines worst-case guarantees with learned optimizations.

Heterogeneous Parallelism: Modern systems combine CPUs, GPUs, FPGAs, specialized accelerators (TPUs). Algorithms must exploit diverse computational resources with different performance characteristics.

Fault Tolerance at Scale: With millions of processors, failures are common, not exceptional. Algorithms must handle transient faults, stragglers, and permanent failures gracefully.

Quantum Parallelism: Quantum algorithms exploit quantum parallelism—evaluating functions on superpositions of inputs. Fundamentally different from classical parallelism.

Green Computing: Power efficiency is critical. Parallel algorithms optimized for energy consumption, not just speed. Tradeoffs between parallelism and energy usage.

Real-World Impact

Modern computing is parallel computing:

Smartphones: Multi-core CPUs, GPUs for graphics and ML
Web services: Distributed across datacenters globally
Scientific computing: Supercomputers with millions of cores
Cloud computing: Elastic parallelism via VMs and containers
AI/ML: Training requires GPUs and distributed systems

Every significant computational task today involves parallelism. Understanding parallel and distributed algorithms is essential for modern software engineering, data science, and systems building.

Ready for students? ✓ Yes - addresses fundamental shift in computing (end of serial performance scaling), bridges theory and practice, strong motivation from ubiquity of parallel hardware, rigorous foundations with practical applications, prepares students for modern computing reality.