Back to Projects
AI/ML Engineering • Project

Incident Commander

Event-Driven Log Analyzer with Tumbling Window Batching

Gemini 2.0 Flash Lite Python Async (asyncio) Event-Driven Architecture Tumbling Window Batching Pydantic Streamlit
Nov 2024 - Jan 2025
Python 3.12
System Prototyping (Event-Driven AI)

Overview

Incident Commander is an event-driven log analysis system designed to solve the 'alert fatigue' problem in network operations. By intelligently filtering and triaging high-velocity log streams using a tumbling window buffer, it reduces noise while ensuring critical events are surfaced.

The system uses an async Python architecture with cascading AI triage via Gemini 2.0 Flash Lite. It features structured JSON logging, semantic root cause clustering, and intelligent event correlation to surface only actionable insights. The tumbling window pattern batches logs (5 seconds or 100 items) for efficient processing.

Built with production-focused architecture, it handles high-velocity log streams gracefully with non-blocking I/O and maintains real-time responsiveness for incident detection.

Key Achievement: Built event-driven log analysis system with tumbling window batching, async architecture, and Pydantic-enforced structured outputs for real-time incident detection

Key Metrics & Results

100
Batch Size Limit
5s
Time Window
Async
Architecture
Pydantic
Schema Validation

Problem Statement

Network operations centers are overwhelmed by log volumes—thousands of entries per minute, most of which are noise. When a 'Meltdown' occurs (e.g., database failure), operators are blinded by a scrolling wall of red text, making Root Cause Analysis (RCA) slow and stressful.

Business Context

Manual triage is impossible at scale, leading to missed critical events and slow incident response times. Traditional rule-based systems create more noise than they filter, exacerbating the alert fatigue problem.

Technical Challenges

Solution Architecture

A three-tier async architecture: (1) Chaos Generator simulates log streams with variable rates, (2) Tumbling Window Ingestor buffers logs (5s or 100 items), (3) Analyzer Agent uses Gemini 2.0 Flash Lite for semantic root cause clustering. All components communicate via async queues with non-blocking I/O.

System Components

Tumbling Window Ingestor

Buffers incoming log streams with asyncio.Queue, implementing a tumbling window pattern that flushes batches after 5 seconds OR when 100 items accumulate. Features non-blocking queue operations to handle packet storms.

AI Analyzer Agent

Uses Gemini 2.0 Flash Lite for semantic understanding of log context. Clusters logs by root cause, identifies severity, and deduplicates similar events. Returns structured Pydantic IncidentReport objects with enforced schema.

Streamlit Dashboard

Real-time visualization showing system status, processed log counts, and consolidated incident cards. Displays raw log stream in sidebar for contrast, demonstrating noise reduction from thousands of logs to actionable incidents.

Technology Stack Rationale

Chose asyncio over threading/multiprocessing for I/O-bound workloads (API calls). Gemini 2.0 Flash Lite provides best balance of speed and cost for always-on monitoring. Pydantic enforces structured outputs preventing LLM hallucination. Tumbling window reduces API calls while maintaining real-time responsiveness.

Implementation Highlights

Key Features

Detailed Code Documentation

Deep dive into the technical implementation with annotated code examples

View Technical Details

Challenges & Solutions

Challenge 1

Processing high-velocity log streams without blocking the UI or dropping packets

Solution

Implemented async queue with asyncio.Queue for non-blocking I/O. Generator emits logs in batches to reduce overhead. Ingestor uses async generators to yield batches without blocking.

Challenge 2

Maintaining real-time responsiveness while performing expensive semantic analysis

Solution

Chose Gemini 2.0 Flash Lite over Pro/Ultra models for speed/cost balance. Tumbling window batches logs reducing API call frequency. Used async API calls (generate_content_async) to prevent blocking.

Challenge 3

Ensuring structured LLM outputs for downstream automation

Solution

Enforced Pydantic IncidentReport schema with JSON response format. Added response parsing to handle markdown code blocks. Implemented fallback reports for API failures or missing keys.

Results & Impact

Deployed as part of TRINITY Project NOC suite. System demonstrates effective noise reduction, compressing high-velocity error logs from meltdown scenarios into consolidated incident cards. Processing maintains real-time responsiveness with efficient batching.

Production Performance

  • Tumbling window batches logs efficiently (5s or 100 items)
  • Async architecture handles high-velocity log streams without blocking
  • Pydantic schema enforcement ensures structured outputs
  • Memory-efficient async architecture with queue-based backpressure

Lessons Learned

What Worked Well

What I'd Do Differently

Future Enhancements

Related Projects