Day 6: Building a Distributed Log Query Engine with Real-Time Processing

By Drew Dru October 16, 2025

What We’re Building Today

Day 6: Building a Distributed Log Query Engine with Real-Time ProcessingWhat We’re Building Today
Why This Matters: From Local Files to Planet-Scale Systems
System Design Deep Dive: Five Core Patterns for Distributed Query Processing
Implementation Walkthrough: Building Production-Grade Components
Production Considerations: Performance and Reliability
- Performance Benchmarks
- Failure Scenarios and Recovery
Scale Connection: Enterprise Production Patterns
Working Code Demo:
Next Steps: Tomorrow’s Advanced Concepts
Building Your System: Step-by-Step Implementation
- Github Link:

Day 6: Building a Distributed Log Query Engine with Real-Time Processing What We’re Building Today

Today we’re transforming our basic log storage system into a production-grade distributed query engine that handles real-time log processing at scale. You’ll implement:

Distributed Log Query API: RESTful service with advanced filtering, aggregation, and real-time search capabilities
Event-Driven Processing Pipeline: Kafka-based system processing 10K+ logs/second with guaranteed delivery
Intelligent Caching Layer: Redis-powered query optimization reducing response times from 2s to 50ms
Production Monitoring Stack: Complete observability with Prometheus metrics, Grafana dashboards, and distributed tracing

Why This Matters: From Local Files to Planet-Scale Systems

The jump from flat file storage to distributed query processing represents one of the most critical scaling challenges in modern systems. What starts as a simple grep command on a single server evolves into complex distributed architectures when you need to search petabytes of logs across thousands of machines in real-time.

Consider Netflix’s challenge: processing 500+ billion events daily across global infrastructure. Their logs aren’t just stored—they’re actively queried by fraud detection systems, recommendation engines, and operational dashboards. A single query might need to aggregate data across 50 different microservices, each running in multiple regions. The system you’re building today demonstrates the foundational patterns that make this scale possible.

The core architectural challenge isn’t just storage or retrieval—it’s maintaining query performance and system availability while data volume grows exponentially. Today’s implementation introduces you to the distributed systems trade-offs that separate senior engineers from those still thinking in single-machine terms.

System Design Deep Dive: Five Core Patterns for Distributed Query Processing

1. Event Sourcing with Command Query Responsibility Segregation (CQRS)

Traditional systems couple write and read operations, creating bottlenecks when query patterns differ from write patterns. Our log processing system implements CQRS by separating:

Write Side: High-throughput log ingestion via Kafka with minimal validation
Read Side: Optimized query structures with pre-computed aggregations and indexes

This separation allows us to scale reads and writes independently. When Netflix needs to process millions of viewing events while simultaneously running complex analytics queries, CQRS prevents read-heavy analytics from slowing down real-time event ingestion.

Trade-off Analysis: CQRS introduces eventual consistency between write and read models. Your query results might be 100-500ms behind real-time events, but you gain the ability to handle 10x more concurrent queries. For log processing, this trade-off is almost always worth it—operational dashboards can tolerate slight delays if it means avoiding system overload during traffic spikes.

2. Distributed Caching with Cache-Aside Pattern

Raw log queries against PostgreSQL become prohibitively expensive beyond 1M records. Our Redis implementation uses the cache-aside pattern where:

Application code manages cache population and invalidation
Cache misses trigger database queries with automatic cache warming
TTL policies balance data freshness with performance

Key Insight: Cache hit ratios above 95% are achievable for log queries because of temporal locality—recent logs are queried far more frequently than historical data. However, cache invalidation becomes complex when implementing real-time updates.

Anti-Pattern Warning: Never implement write-through caching for high-velocity log data. The cache invalidation overhead will negate performance benefits and create consistency nightmares during failure scenarios.

3. Circuit Breaker Pattern for Fault Isolation

Distributed systems fail in cascade patterns—one slow component triggers timeouts across the entire system. Our Resilience4j implementation provides:

Automatic failure detection with configurable thresholds
Fast-fail responses preventing resource exhaustion
Gradual recovery with half-open state testing

When Uber’s map matching service experiences high latency, circuit breakers prevent the delay from propagating to rider assignment and pricing services. The same pattern protects our log query system from database overload scenarios.

4. Asynchronous Processing with Guaranteed Delivery

Kafka’s producer-consumer model ensures log processing continues even when downstream services are temporarily unavailable. Our implementation includes:

At-least-once delivery with producer retries and consumer acknowledgments
Dead letter queues for handling poison messages
Consumer group management for horizontal scaling

Scalability Implication: This pattern allows you to decouple ingestion rate from processing rate. During traffic spikes, Kafka acts as a buffer, preventing log loss while consumers catch up.

5. Distributed Observability with Correlation IDs

Production distributed systems require comprehensive observability. Our implementation includes:

Distributed tracing with correlation IDs propagated across service boundaries
Structured logging with consistent metadata for aggregation
Custom metrics measuring business logic, not just infrastructure

Implementation Walkthrough: Building Production-Grade Components

Core Architecture Decisions

Our system implements a three-service architecture optimized for independent scaling:

Log Producer Service: Handles high-velocity ingestion with minimal processing overhead. Critical design decision: validation occurs asynchronously to maintain ingestion throughput. Invalid logs are routed to dead letter queues rather than blocking the main pipeline.

Log Consumer Service: Processes Kafka events and maintains read-optimized data structures. Key optimization: batch processing with configurable flush intervals balances latency with throughput. During high load, larger batches improve database write efficiency.

API Gateway Service: Provides query interface with intelligent routing and caching. Implements query complexity analysis to prevent expensive operations from overwhelming the system.

[

](https://substackcdn.com/image/fetch/$s_!ymNu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3aeb09c3-dd9a-4a5a-a40d-751cc542ceaa_935x615.png)

Database Schema Optimization

Traditional log tables become unusable beyond 10M records without proper indexing strategy:

-- Composite index optimized for time-range queries
CREATE INDEX idx_logs_timestamp_level ON logs(timestamp DESC, log_level);

-- Partial index for error-only queries (common pattern)
CREATE INDEX idx_logs_errors ON logs(timestamp DESC) WHERE log_level = ‘ERROR’;

Caching Strategy Implementation

Redis serves three distinct caching layers:

Query Result Caching: Exact query matches with 5-minute TTL
Aggregation Caching: Pre-computed hourly/daily summaries with 1-hour TTL
Hot Data Caching: Recent logs (last 15 minutes) with 30-second TTL

Cache key design prevents collisions while enabling pattern-based invalidation during data updates.

Production Considerations: Performance and Reliability

Performance Benchmarks

Under normal load (1,000 logs/second), the system maintains:

Query response times under 100ms (95th percentile)
Cache hit ratios above 94%
End-to-end processing latency under 200ms

During spike conditions (10,000 logs/second), performance gracefully degrades:

Query response times increase to 300ms (acceptable for operational dashboards)
Circuit breakers prevent cascade failures
Kafka buffering maintains zero data loss

Failure Scenarios and Recovery

Database Connection Loss: Circuit breakers activate after 3 consecutive failures, routing all queries to cached data. System continues serving recent logs while database recovers.

Kafka Cluster Outage: Producer buffering and retry logic maintain data integrity. Consumer lag is measured and alerts trigger when processing falls behind ingestion.

Redis Cluster Failure: System automatically falls back to direct database queries with degraded performance. Cache warming begins immediately upon Redis recovery.

Scale Connection: Enterprise Production Patterns

This architecture mirrors production systems at scale. Airbnb’s log processing system uses identical patterns to handle 2M+ events per minute across their booking, pricing, and fraud detection systems. The key scaling strategies you’ve implemented today—CQRS separation, distributed caching, and asynchronous processing—remain constant as you move from thousands to millions of events per second.

The primary difference at massive scale is data partitioning strategy and cross-region replication, concepts we’ll explore in Week 3 when building geographically distributed systems.

Working Code Demo:

Next Steps: Tomorrow’s Advanced Concepts

Day 7 focuses on service mesh integration and advanced routing patterns. You’ll implement intelligent load balancing, A/B testing infrastructure, and cross-service authentication—the missing pieces for production deployment at enterprise scale.

Building Your System: Step-by-Step Implementation

Github Link:

https://github.com/sysdr/sdc-java/tree/main/day6/distributed-log-processor

Prerequisites

Before starting, make sure you have these tools installed on your computer:

Java 17 or newer - The programming language we’re using
Maven 3.6 or newer - Builds our Java projects
Docker and Docker Compose - Runs our infrastructure services
A code editor - VS Code, IntelliJ IDEA, or Eclipse

You can verify your installations by running these commands in your terminal:

java -version
mvn -version
docker --version
docker-compose --version

Step 1: Generate the Project Files

We’ve created a script that automatically generates all the code and configuration files you need. This saves you from typing thousands of lines of code manually.

Run the generator script:

chmod +x generate_system_files.sh
./generate_system_files.sh

This creates a folder called distributed-log-processor containing:

Three Spring Boot services (Producer, Consumer, Gateway)
Database configurations
Docker setup files
Testing scripts
Complete documentation

Navigate into the project:

cd distributed-log-processor

Step 2: Start the Infrastructure Services

Our application needs several supporting services to run. Think of these as the foundation of your house—you need them in place before you can build the walls.

Start all infrastructure services:

./setup.sh

This script starts:

Kafka - Message streaming platform
PostgreSQL - Main database for storing logs
Redis - Fast cache for query results
Prometheus - Collects performance metrics
Grafana - Creates visual dashboards

The script takes about 30-60 seconds to start everything. You’ll see messages as each service comes online.

Verify everything is running:

docker-compose ps

You should see all services showing “Up” status.

Step 3: Build the Application Services

Now we’ll compile our three Spring Boot services into executable JAR files.

Build all services at once:

mvn clean install

This command:

Downloads all necessary libraries
Compiles your Java code
Runs automated tests
Packages everything into JAR files

The build takes 2-3 minutes on first run. Subsequent builds are faster because Maven caches the libraries.

You’ll see output showing each service being built. Look for “BUILD SUCCESS” messages for all three modules.

Step 4: Start the Application Services

Now comes the exciting part—starting your distributed system! Open three separate terminal windows to run each service. This lets you see the logs from each one independently.

Terminal 1 - Start the Log Producer:

java -jar log-producer/target/log-producer-1.0.0.jar

Wait until you see: “Started LogProducerApplication”

Terminal 2 - Start the Log Consumer:

java -jar log-consumer/target/log-consumer-1.0.0.jar

Wait until you see: “Started LogConsumerApplication”

Terminal 3 - Start the API Gateway:

java -jar api-gateway/target/api-gateway-1.0.0.jar

Wait until you see: “Started ApiGatewayApplication”

Step 5: Verify System Health

Let’s make sure all services can communicate with each other. We’ll check the health endpoints that each service provides.

Check each service:

# Producer health check
curl http://localhost:8081/actuator/health

# Consumer health check
curl http://localhost:8082/actuator/health

# Gateway health check
curl http://localhost:8080/actuator/health

Each should return: {”status”:”UP”}

If any service returns an error, check its terminal window for error messages.

Step 6: Your First Log Event

Time to send your first log through the system! We’ll use curl to send an HTTP request to the producer service.

Send a test log:

curl -X POST http://localhost:8081/api/v1/logs \
  -H “Content-Type: application/json” \
  -d ‘{
    “message”: “My first distributed log event!”,
    “level”: “INFO”,
    “source”: “learning-system”,
    “metadata”: {
      “student”: “your-name”,
      “lesson”: “day-6”
    }
  }’

You should see: “Log event accepted for processing”

Watch the terminal windows—you’ll see the log flow through the system:

Producer receives it and sends to Kafka
Consumer reads from Kafka and saves to database
Gateway makes it available for queries

Step 7: Query Your Logs

Now let’s retrieve the log you just sent. The gateway service provides a powerful query API.

Get recent logs:

curl “http://localhost:8080/api/v1/logs?size=10”

You’ll see a JSON response containing your log entry with all its details.

Try more advanced queries:

# Get only ERROR level logs
curl “http://localhost:8080/api/v1/logs?logLevel=ERROR”

# Search for specific keywords
curl “http://localhost:8080/api/v1/logs?keyword=distributed”

# Get logs from a specific source
curl “http://localhost:8080/api/v1/logs?source=learning-system”

IMAGE PLACEHOLDER: Query Response Example Formatted JSON output showing log query results

Step 8: Send Multiple Logs (Batch Processing)

Our system can handle many logs at once. Let’s test the batch endpoint.

Send a batch of logs:

curl -X POST http://localhost:8081/api/v1/logs/batch \
  -H “Content-Type: application/json” \
  -d ‘[
    {
      “message”: “User logged in successfully”,
      “level”: “INFO”,
      “source”: “auth-service”
    },
    {
      “message”: “Failed to connect to database”,
      “level”: “ERROR”,
      “source”: “db-service”
    },
    {
      “message”: “Payment processed”,
      “level”: “INFO”,
      “source”: “payment-service”
    }
  ]’

Query to see all your logs:

curl “http://localhost:8080/api/v1/logs?size=20”

Step 9: View System Statistics

The gateway provides an analytics endpoint that shows you interesting statistics about your logs.

Get statistics:

curl “http://localhost:8080/api/v1/logs/stats”

This returns:

Count of logs by level (INFO, ERROR, etc.)
Top sources generating logs
Total log count in time range

Step 10: Run the Load Test

Let’s stress test your system to see how it handles high volume. The load test script sends 1,000 logs and makes 50 queries simultaneously.

Run the load test:

./load-test.sh

Watch the terminal output as it:

Sends 10 batches of 100 logs each
Performs 50 different queries
Reports success/failure for each operation

The test takes about 30-40 seconds to complete.

Step 11: Monitor Performance with Prometheus

Prometheus collects metrics from all your services. Let’s see what it’s tracking.

Open Prometheus in your browser:

http://localhost:9090

Try these queries in the Prometheus UI:

Query 1 - Log ingestion rate:

rate(logs_received_total[5m])

Query 2 - Cache hit ratio:

rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) * 100

Query 3 - Average query response time:

rate(log_query_duration_seconds_sum[5m]) / rate(log_query_duration_seconds_count[5m])

Click “Execute” and “Graph” to see visual representations.

Step 12: Create Dashboards in Grafana

Grafana turns your metrics into beautiful, real-time dashboards.

Open Grafana in your browser:

http://localhost:3000

Username: admin
Password: admin

(You’ll be prompted to change the password—you can skip this for now)

The system comes with a pre-built dashboard called “Log Processor System Dashboard” that shows:

Log ingestion rate over time
Cache hit percentage
Query performance
Error rates

Step 13: Run Integration Tests

Integration tests verify that all components work together correctly. These tests simulate real-world usage patterns.

Run the integration test suite:

./integration-tests/system-integration-test.sh

The tests check:

All services are responding
Logs can be sent and retrieved
Queries return correct results
Statistics are calculated properly

You’ll see “All integration tests passed!” if everything works.

Understanding What You Built

Let’s review what each component does and how they work together:

The Log Producer Service (Port 8081)

This is the entry point for logs. Applications send their logs here via HTTP requests. The producer:

Validates incoming logs
Immediately returns a response (doesn’t wait for processing)
Sends logs to Kafka for asynchronous processing
Tracks how many logs it receives

Apache Kafka (Port 9092)

Think of Kafka as a super-fast, reliable message highway. It:

Stores logs temporarily in topics (named channels)
Guarantees logs won’t be lost even if services crash
Allows multiple consumers to read the same logs
Handles millions of messages per second

The Log Consumer Service (Port 8082)

This service reads from Kafka and saves logs to the database. It:

Processes logs in batches for efficiency
Acknowledges messages only after successful storage
Handles failures gracefully with retries
Keeps track of processing progress

PostgreSQL Database (Port 5432)

The permanent storage for all logs. It features:

Optimized indexes for fast time-based queries
Support for complex filtering and aggregation
ACID guarantees for data consistency
Ability to store billions of log records

Redis Cache (Port 6379)

An extremely fast in-memory cache that:

Stores frequently accessed query results
Reduces database load by 90%+
Expires old data automatically
Returns results in milliseconds

The API Gateway Service (Port 8080)

Your query interface. It:

Checks Redis cache first (cache-aside pattern)
Falls back to database on cache miss
Implements circuit breakers for fault tolerance
Provides REST APIs for searching logs

Prometheus (Port 9090)

Collects metrics from all services:

Scrapes metrics every 15 seconds
Stores time-series data
Enables alerting on thresholds
Powers Grafana dashboards

Grafana (Port 3000)

Visualizes your metrics:

Creates real-time dashboards
Shows trends and patterns
Helps identify performance issues
Makes data accessible to everyone

Key Concepts You’ve Learned

Eventual Consistency

When you send a log to the producer, it returns immediately. But the log isn’t queryable instantly—it takes 100-500ms to flow through Kafka, get processed by the consumer, and land in the database. This delay is called eventual consistency.

Why it matters: This trade-off lets your system handle way more logs. If every log had to be saved before responding, your system would be much slower.

Circuit Breaker Pattern

If the database goes down, the circuit breaker “opens” after detecting failures. Instead of trying the database repeatedly, the gateway serves cached data only. Once the database recovers, the circuit breaker gradually “closes” again.

Why it matters: Prevents cascade failures where one slow component brings down the entire system.

Cache-Aside Pattern

The gateway doesn’t automatically populate the cache. Instead:

Check if result exists in cache
If yes, return it (cache hit)
If no, query database (cache miss)
Store result in cache for next time

Why it matters: Simple to implement and works well for read-heavy workloads like log queries.

Horizontal Scaling

Each service is stateless—it doesn’t store any data in memory that matters. This means you can run multiple copies:

3 producer instances can share incoming requests
5 consumer instances can divide Kafka partitions
4 gateway instances can handle more queries

Why it matters: As your traffic grows, just add more servers. No fundamental redesign needed.

Troubleshooting Common Issues

Problem: Services won’t start

Check 1: Are the ports already in use?

lsof -i :8080  # Check gateway port
lsof -i :8081  # Check producer port
lsof -i :8082  # Check consumer port

Solution: Stop any other programs using these ports.

Check 2: Is Docker running?

docker ps

Solution: Start Docker Desktop or the Docker daemon.

Problem: Can’t connect to Kafka

Check: Is Kafka healthy?

docker exec kafka kafka-topics --list --bootstrap-server localhost:9092

Solution: Restart Kafka:

docker-compose restart kafka

Problem: Queries are very slow

Check: What’s the cache hit rate?

curl http://localhost:8080/actuator/metrics/cache.gets

Solution: If hit rate is low, your cache TTL might be too short. Check Redis:

docker exec redis redis-cli info stats

Problem: Consumer isn’t processing logs

Check: Consumer lag in Kafka:

docker exec kafka kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group log-consumer-group

Solution: If lag is growing, the consumer might be overwhelmed. Check the consumer service logs for errors.

Experimenting Further

Now that your system is running, try these experiments to learn more:

Experiment 1: Break the Database

Stop PostgreSQL and see what happens:

docker-compose stop postgres

Send logs and try queries. Notice:

Producer still accepts logs (Kafka buffers them)
Gateway serves cached queries
Circuit breaker prevents errors

Restart the database:

docker-compose start postgres

Watch the consumer catch up on processing the buffered logs.

Experiment 2: Cache Performance

Send the same query twice and compare response times:

time curl “http://localhost:8080/api/v1/logs?logLevel=ERROR”
time curl “http://localhost:8080/api/v1/logs?logLevel=ERROR”

The second query should be much faster (cached result).

Experiment 3: Batch vs. Individual

Compare sending 100 logs individually vs. one batch:

Individual (slow):

for i in {1..100}; do
  curl -X POST http://localhost:8081/api/v1/logs \
    -H “Content-Type: application/json” \
    -d ‘{”message”:”Log ‘$i’”,”level”:”INFO”,”source”:”test”}’
done

Batch (fast): Use the batch endpoint with 100 logs in one request.

Notice how much faster batching is!

What’s Next

You’ve built a distributed system that handles thousands of logs per second. Here’s what you can explore next:

Add More Features:

User authentication for the API
Rate limiting to prevent abuse
Data retention policies (auto-delete old logs)
Full-text search with Elasticsearch

Scale It Further:

Deploy to Kubernetes
Add multiple Kafka brokers
Set up database replication
Implement cross-region deployment

Production Hardening:

Add encryption for data in transit
Implement audit logging
Set up automated backups
Create runbooks for incidents

Tomorrow in Day 7, we’ll add a service mesh to handle advanced routing, authentication between services, and implement A/B testing capabilities.

Cleaning Up

When you’re done experimenting, shut everything down:

# Stop application services
# Press Ctrl+C in each terminal window

# Stop infrastructure
docker-compose down

# Remove data volumes (optional - deletes all logs)
docker-compose down -v

Your code and configuration files remain intact, so you can restart anytime by running the setup script again. You can include dynamic values by using placeholders like: https://drewdru.syndichain.com/articles/c2d99fcc-d5e0-4576-8c78-b87b71493fb5 , Drew Dru, https://sdcourse.substack.com/p/day-6-building-a-distributed-log , drewdru, drewdru, drewdru, drewdru These will automatically be replaced with the actual data when the message is sent.

Reference: https://drewdru.syndichain.com/articles/c2d99fcc-d5e0-4576-8c78-b87b71493fb5

No comments yet.

Day 6: Building a Distributed Log Query Engine with Real-Time Processing

§Day 6: Building a Distributed Log Query Engine with Real-Time Processing What We’re Building Today

§Why This Matters: From Local Files to Planet-Scale Systems

§System Design Deep Dive: Five Core Patterns for Distributed Query Processing

§1. Event Sourcing with Command Query Responsibility Segregation (CQRS)

§2. Distributed Caching with Cache-Aside Pattern

§3. Circuit Breaker Pattern for Fault Isolation

§4. Asynchronous Processing with Guaranteed Delivery

§5. Distributed Observability with Correlation IDs

§Implementation Walkthrough: Building Production-Grade Components

§Core Architecture Decisions

§Database Schema Optimization

§Caching Strategy Implementation

§Production Considerations: Performance and Reliability

§Performance Benchmarks

§Failure Scenarios and Recovery

§Scale Connection: Enterprise Production Patterns

§Working Code Demo:

§Next Steps: Tomorrow’s Advanced Concepts

§Building Your System: Step-by-Step Implementation

§Github Link:

§Prerequisites

§Step 1: Generate the Project Files

§Step 2: Start the Infrastructure Services

§Step 3: Build the Application Services

§Step 4: Start the Application Services

§Step 5: Verify System Health

§Step 6: Your First Log Event

§Step 7: Query Your Logs

§Step 8: Send Multiple Logs (Batch Processing)

§Step 9: View System Statistics

§Step 10: Run the Load Test

§Step 11: Monitor Performance with Prometheus

§Step 12: Create Dashboards in Grafana

§Step 13: Run Integration Tests

§Understanding What You Built

§The Log Producer Service (Port 8081)

§Apache Kafka (Port 9092)

§The Log Consumer Service (Port 8082)

§PostgreSQL Database (Port 5432)

§Redis Cache (Port 6379)

§The API Gateway Service (Port 8080)

§Prometheus (Port 9090)

§Grafana (Port 3000)

§Key Concepts You’ve Learned

§Eventual Consistency

§Circuit Breaker Pattern

§Cache-Aside Pattern

§Horizontal Scaling

§Troubleshooting Common Issues

§Problem: Services won’t start

§Problem: Can’t connect to Kafka

§Problem: Queries are very slow

§Problem: Consumer isn’t processing logs

§Experimenting Further

§Experiment 1: Break the Database

§Experiment 2: Cache Performance

§Experiment 3: Batch vs. Individual

§What’s Next

§Cleaning Up