This project implements a scalable user feed system that processes user activities in real-time while ensuring exactly-once semantics and high availability. The system handles user interactions like follows, posts, comments and likes through a multi-stage data pipeline.
Key Technical Components:
- Source System (PostgreSQL)
- Primary database storing user activities and interactions
- Configured with logical replication (wal_level=logical) to enable Change Data Capture
- Handles transactional writes for user activities through REST API endpoints
- Tables are configured with primary keys and appropriate indexes
- Change Data Capture (Debezium)
- Monitors PostgreSQL Write-Ahead Logs (WAL) for DML changes
- Creates change events in Avro format with before/after states
- Ensures reliable capture of all data modifications
- Maintains offset tracking for exactly-once delivery
- Publishes changes to dedicated Kafka topics
- Message Queue (Apache Kafka)
- Provides durable storage of change events
- Maintains strict ordering within partitions
- Enables parallel processing through multiple partitions
- Handles back-pressure and provides buffer capacity
- Topics configured with appropriate retention and replication
- Stream Processing (Apache Flink)
- Stateful stream processing with exactly-once guarantees
- Uses checkpointing to handle failures and state recovery
- Processes events in micro-batches for efficiency
- Aggregates activities by user for feed generation
- Maintains watermarks for handling out-of-order events
- Cache Layer (Redis)
- In-memory caching of frequently accessed feeds
- Implements sliding window cache invalidation
- Provides sub-millisecond read latency for hot data
- Uses sorted sets for feed pagination
- Falls back to Cassandra for cache misses
- Storage Layer (Apache Cassandra)
- Log-Structured Merge (LSM) tree based storage
- Optimized for high-throughput writes
- Partitioned by user_id for efficient reads
- Stores feed items in time-sorted order
- Supports efficient pagination queries
Data Flow:
- User performs activity (follow/post/comment/like) via REST API
- Activity is recorded in PostgreSQL tables
- Debezium captures change events from PostgreSQL WAL
- Events are published to corresponding Kafka topics
- Flink consumes events with exactly-once guarantees
- Stream processors aggregate activities into user feeds
- Aggregated feeds are written to Cassandra
- Redis caches frequently accessed feed segments
- Feed API serves requests from cache with Cassandra fallback
The architecture ensures:
- Exactly-once processing semantics
- Horizontal scalability at each layer
- High write throughput for activity ingestion
- Low latency reads for feed serving
- Fault tolerance and high availability
- Efficient pagination support
- Follow User ("FOLLOW")
- User Create New Post ("SHARD")
- User Comment on Post ("COMMENT")
- User Likes a post ("LIKE")
#!/bin/bash
docker-compose down -v
docker-compose pull
docker-compose up -d
docker-compose ps
- Create new virtual environment
python3 -m venv venv
- Activate that environment
source venv/bin/activate
- Run the development server
uvicorn main:app --reload
- Test the setup using a shell script
#!/bin/bash
chmod +x test_setup.sh
./test-setup.sh