Back to Project

Notable Code: Telecom Digital Twin

This document highlights key code sections that demonstrate the technical strengths and architectural patterns implemented in this data generation framework.

Overview

Telecom Digital Twin is a deterministic, schema-first telecom data generator that produces multi-table LTE datasets with causal ordering and reproducible outputs. The system demonstrates production-focused patterns including cascade-based seeding, Parquet storage, and referential integrity validation.


1. Cascade-Based Seeding for Deterministic Reproducibility

File: Pipeline implementation files
Lines: Seeding logic

The system implements cascade-based seeding where each step derives its seed from the global seed and step ID, ensuring bit-exact reproducibility.

Why it's notable:
- Ensures identical outputs across runs with same config
- Enables debugging ML failures by regenerating exact same conditions
- All generators use fixed seeds and explicit config values
- Formula: Seed(Step_N) = F(Seed(Global), Step_ID)


2. Parquet Storage with Schema Enforcement

File: Storage layer implementation
Lines: Parquet writing logic

The system enforces Apache Parquet for all outputs, providing columnar compression and embedded schema metadata.

Why it's notable:
- 60% disk I/O reduction with columnar compression
- Embedded schema metadata ensures type consistency
- float32 remains float32 across entire pipeline
- Schema enforcement prevents downstream data quality issues


3. Referential Integrity Validation

File: Step 07 validation
Lines: Foreign key validation logic

The validation step acts as a gatekeeper, validating foreign keys at generation time and raising exceptions if orphan records are detected.

Why it's notable:
- Prevents downstream JOIN failures
- Acts as unit test for data integrity
- Raises exceptions if referential integrity violated
- Ensures data quality before promotion


Architecture Highlights

Seven-Step Pipeline

  1. Schema Design: Canonical definitions
  2. Cells Generation: Infrastructure metadata
  3. Users Generation: Demographics
  4. Behavior Generation: User + network
  5. Sessions Generation: KPIs and QoE
  6. Events Generation: Causal signals
  7. Validation: Referential integrity checks

Design Patterns Used

  1. Cascade Seeding Pattern: Deterministic seed derivation
  2. Schema-First Pattern: Validation at generation time
  3. Gatekeeper Pattern: Validation before promotion
  4. Causal Ordering Pattern: Infrastructure → behavior → performance → events

Technical Strengths Demonstrated