Event-Driven Log Analyzer with Tumbling Window Batching
Incident Commander is an event-driven log analysis system designed to solve the 'alert fatigue' problem in network operations. By intelligently filtering and triaging high-velocity log streams using a tumbling window buffer, it reduces noise while ensuring critical events are surfaced.
The system uses an async Python architecture with cascading AI triage via Gemini 2.0 Flash Lite. It features structured JSON logging, semantic root cause clustering, and intelligent event correlation to surface only actionable insights. The tumbling window pattern batches logs (5 seconds or 100 items) for efficient processing.
Built with production-focused architecture, it handles high-velocity log streams gracefully with non-blocking I/O and maintains real-time responsiveness for incident detection.
Network operations centers are overwhelmed by log volumes—thousands of entries per minute, most of which are noise. When a 'Meltdown' occurs (e.g., database failure), operators are blinded by a scrolling wall of red text, making Root Cause Analysis (RCA) slow and stressful.
Manual triage is impossible at scale, leading to missed critical events and slow incident response times. Traditional rule-based systems create more noise than they filter, exacerbating the alert fatigue problem.
A three-tier async architecture: (1) Chaos Generator simulates log streams with variable rates, (2) Tumbling Window Ingestor buffers logs (5s or 100 items), (3) Analyzer Agent uses Gemini 2.0 Flash Lite for semantic root cause clustering. All components communicate via async queues with non-blocking I/O.
Buffers incoming log streams with asyncio.Queue, implementing a tumbling window pattern that flushes batches after 5 seconds OR when 100 items accumulate. Features non-blocking queue operations to handle packet storms.
Uses Gemini 2.0 Flash Lite for semantic understanding of log context. Clusters logs by root cause, identifies severity, and deduplicates similar events. Returns structured Pydantic IncidentReport objects with enforced schema.
Real-time visualization showing system status, processed log counts, and consolidated incident cards. Displays raw log stream in sidebar for contrast, demonstrating noise reduction from thousands of logs to actionable incidents.
Chose asyncio over threading/multiprocessing for I/O-bound workloads (API calls). Gemini 2.0 Flash Lite provides best balance of speed and cost for always-on monitoring. Pydantic enforces structured outputs preventing LLM hallucination. Tumbling window reduces API calls while maintaining real-time responsiveness.
Deep dive into the technical implementation with annotated code examples
View Technical DetailsProcessing high-velocity log streams without blocking the UI or dropping packets
Implemented async queue with asyncio.Queue for non-blocking I/O. Generator emits logs in batches to reduce overhead. Ingestor uses async generators to yield batches without blocking.
Maintaining real-time responsiveness while performing expensive semantic analysis
Chose Gemini 2.0 Flash Lite over Pro/Ultra models for speed/cost balance. Tumbling window batches logs reducing API call frequency. Used async API calls (generate_content_async) to prevent blocking.
Ensuring structured LLM outputs for downstream automation
Enforced Pydantic IncidentReport schema with JSON response format. Added response parsing to handle markdown code blocks. Implemented fallback reports for API failures or missing keys.
Deployed as part of TRINITY Project NOC suite. System demonstrates effective noise reduction, compressing high-velocity error logs from meltdown scenarios into consolidated incident cards. Processing maintains real-time responsiveness with efficient batching.