Incident Commander

Overview

Incident Commander is an event-driven log analysis system designed to solve the 'alert fatigue' problem in network operations. By intelligently filtering and triaging high-velocity log streams using a tumbling window buffer, it reduces noise while ensuring critical events are surfaced.

The system uses an async Python architecture with cascading AI triage via Gemini 2.0 Flash Lite. It features structured JSON logging, semantic root cause clustering, and intelligent event correlation to surface only actionable insights. The tumbling window pattern batches logs (5 seconds or 100 items) for efficient processing.

Built with production-focused architecture, it handles high-velocity log streams gracefully with non-blocking I/O and maintains real-time responsiveness for incident detection.

                Key Achievement: Built event-driven log analysis system with tumbling window batching, async architecture, and Pydantic-enforced structured outputs for real-time incident detection
            

Key Metrics & Results

100

Batch Size Limit

5s

Time Window

Async

Architecture

Pydantic

Schema Validation

Problem Statement

Network operations centers are overwhelmed by log volumes—thousands of entries per minute, most of which are noise. When a 'Meltdown' occurs (e.g., database failure), operators are blinded by a scrolling wall of red text, making Root Cause Analysis (RCA) slow and stressful.

Business Context

Manual triage is impossible at scale, leading to missed critical events and slow incident response times. Traditional rule-based systems create more noise than they filter, exacerbating the alert fatigue problem.

Technical Challenges

Processing high-velocity log streams without dropping events or blocking the UI
Distinguishing critical events from routine noise using AI
Maintaining real-time responsiveness for alerting
Handling high-velocity log streams with memory efficiency

Solution Architecture

A three-tier async architecture: (1) Chaos Generator simulates log streams with variable rates, (2) Tumbling Window Ingestor buffers logs (5s or 100 items), (3) Analyzer Agent uses Gemini 2.0 Flash Lite for semantic root cause clustering. All components communicate via async queues with non-blocking I/O.

System Components

Tumbling Window Ingestor

Buffers incoming log streams with asyncio.Queue, implementing a tumbling window pattern that flushes batches after 5 seconds OR when 100 items accumulate. Features non-blocking queue operations to handle packet storms.

AI Analyzer Agent

Uses Gemini 2.0 Flash Lite for semantic understanding of log context. Clusters logs by root cause, identifies severity, and deduplicates similar events. Returns structured Pydantic IncidentReport objects with enforced schema.

Streamlit Dashboard

Real-time visualization showing system status, processed log counts, and consolidated incident cards. Displays raw log stream in sidebar for contrast, demonstrating noise reduction from thousands of logs to actionable incidents.

Technology Stack Rationale

Chose asyncio over threading/multiprocessing for I/O-bound workloads (API calls). Gemini 2.0 Flash Lite provides best balance of speed and cost for always-on monitoring. Pydantic enforces structured outputs preventing LLM hallucination. Tumbling window reduces API calls while maintaining real-time responsiveness.

Implementation Highlights

Key Features

Tumbling Window Batching: Collects logs for 5 seconds or until 100 items accumulate, then batches them for AI analysis
Semantic Root Cause Clustering: Uses Gemini 2.0 Flash Lite to cluster logs by root cause, ignoring transient noise
Pydantic Schema Enforcement: Forces LLM to adhere to strict IncidentReport schema with validation
Real-time Streamlit Dashboard: Live monitoring showing system status, log processing metrics, and consolidated incident reports

Detailed Code Documentation

Deep dive into the technical implementation with annotated code examples

View Technical Details

Challenges & Solutions

Challenge 1

Processing high-velocity log streams without blocking the UI or dropping packets

Solution

Implemented async queue with asyncio.Queue for non-blocking I/O. Generator emits logs in batches to reduce overhead. Ingestor uses async generators to yield batches without blocking.

Challenge 2

Maintaining real-time responsiveness while performing expensive semantic analysis

Solution

Chose Gemini 2.0 Flash Lite over Pro/Ultra models for speed/cost balance. Tumbling window batches logs reducing API call frequency. Used async API calls (generate_content_async) to prevent blocking.

Challenge 3

Ensuring structured LLM outputs for downstream automation

Solution

Enforced Pydantic IncidentReport schema with JSON response format. Added response parsing to handle markdown code blocks. Implemented fallback reports for API failures or missing keys.

Results & Impact

Deployed as part of TRINITY Project NOC suite. System demonstrates effective noise reduction, compressing high-velocity error logs from meltdown scenarios into consolidated incident cards. Processing maintains real-time responsiveness with efficient batching.

                Production Performance
                Tumbling window batches logs efficiently (5s or 100 items)
Async architecture handles high-velocity log streams without blocking
Pydantic schema enforcement ensures structured outputs
Memory-efficient async architecture with queue-based backpressure

            

Lessons Learned

What Worked Well

Async architecture scaled effortlessly to handle high-velocity streams without blocking
Gemini 2.0 Flash Lite provided excellent speed/accuracy balance for observability use case
Tumbling window pattern reduced API costs while maintaining real-time detection
Pydantic schema enforcement prevented LLM output inconsistencies

What I'd Do Differently

Would migrate to google-genai SDK earlier (current google-generativeai deprecated as of Jan 2025)
Should add Prometheus metrics for better production observability
Could implement circuit breaker pattern for API resilience
Would add SQLite persistence for incident history beyond in-memory state

Future Enhancements

Migrate to google-genai SDK before June 2026 deprecation deadline
Add multi-model support (Claude, GPT-4) with automatic failover
Implement distributed architecture for horizontal scaling
Add incident correlation across time windows for cascading failure detection

Overview

Key Metrics & Results

Problem Statement

Business Context

Technical Challenges

Solution Architecture

System Components

Tumbling Window Ingestor

AI Analyzer Agent

Streamlit Dashboard

Technology Stack Rationale

Implementation Highlights

Key Features

Detailed Code Documentation

Challenges & Solutions

Challenge 1

Solution

Challenge 2

Solution

Challenge 3

Solution

Results & Impact

Production Performance

Lessons Learned

What Worked Well

What I'd Do Differently

Future Enhancements

Related Projects

noc-oracle

net-ops-agent

google-cloud-ai-studio