Back to Projects
Data Engineering • Project

Telecom Digital Twin

Deterministic Multi-Table Telecom Data Generator with Causal Ordering

Apache Parquet Synthetic Data Vault (SDV) Pandas Deterministic Seeding Schema Validation
2024 - 2025
Python 3.11
Data Generation Framework

Overview

Telecom Digital Twin is a deterministic, schema-first telecom data generator that produces a multi-table LTE dataset (users, cells, sessions, events) with causal ordering and reproducible outputs. Designed as an MLOps testbed, it generates realistic telecom data without private information.

The system uses cascade-based seeding where seeds are derived from global seed and step ID, ensuring bit-exact reproducibility. It enforces strict schema validation with Apache Parquet for columnar compression and embedded schema metadata. All outputs maintain referential integrity with foreign key validation.

Built for ML-ready data generation, the pipeline produces clean, validated datasets with proper correlations across infrastructure, behavior, performance, and events. It supports fast development mode for quick iteration and full-scale generation for production use.

Key Achievement: Built deterministic data generation pipeline with cascade-based seeding, Parquet storage, schema-first validation, and referential integrity checks for ML-ready telecom data

Key Metrics & Results

Deterministic
Seeding
4
Output Tables
Parquet
Storage
Schema First
Validation

Problem Statement

Production telecom data is proprietary and sensitive, making it unavailable for ML development and testing. Off-the-shelf synthetic data tools lack domain expertise and produce unrealistic correlations. ML pipelines require deterministic, reproducible data for debugging and validation.

Business Context

Data scientists need realistic telecom datasets for model development without accessing production data. ML pipelines require bit-exact reproducibility to debug failures. Data quality issues (orphan records, type mismatches) break downstream JOINs and model training.

Technical Challenges

Solution Architecture

A seven-step pipeline: (1) Schema design with canonical definitions, (2) Cells generation with infrastructure metadata, (3) Users generation with demographics, (4) Behavior generation (user + network), (5) Sessions generation with KPIs and QoE, (6) Events generation with causal signals, (7) Validation with referential integrity checks.

System Components

Deterministic Seeding System

Cascade-based seeding where Seed(Step_N) = F(Seed(Global), Step_ID). Ensures bit-exact reproducibility across runs. All generators use fixed seeds and explicit config values.

Parquet Storage Layer

Enforces Apache Parquet for all outputs. Provides columnar compression and embedded schema metadata ensuring type consistency (float32 remains float32).

Schema-First Generation

Strict schema enforcement with validation at generation time. Foreign key validation in Step 07 acts as unit test for data integrity, raising exceptions if referential integrity violated.

Causal Ordering System

Generates data with proper causal relationships: infrastructure → behavior → performance → events. Network-level events omit user/session IDs; user events include IDs.

Technology Stack Rationale

Parquet chosen over CSV for schema enforcement and compression. Deterministic seeding enables reproducible ML pipelines. Schema validation prevents downstream JOIN failures. Causal ordering ensures realistic data relationships. SDV provides domain-informed generation.

Implementation Highlights

Key Features

Detailed Code Documentation

Deep dive into the technical implementation with annotated code examples

View Technical Details

Challenges & Solutions

Challenge 1

Ensuring bit-exact reproducibility across pipeline runs

Solution

Implemented cascade-based seeding where each step derives seed from global seed and step ID. All generators use fixed seeds and explicit config values. Re-running with same config yields identical outputs.

Challenge 2

Maintaining referential integrity across multiple tables

Solution

Implemented Gatekeeper pattern in Step 07 validation. Validates foreign keys at generation time, raising exceptions if orphan records detected. Acts as unit test for data integrity.

Challenge 3

Optimizing storage for large-scale data generation

Solution

Enforced Apache Parquet for all outputs. Provides columnar compression and embedded schema metadata. Ensures type consistency across entire pipeline.

Challenge 4

Generating realistic telecom data with proper correlations

Solution

Used SDV for domain-informed generation. Encoded causal relationships: infrastructure → behavior → performance → events. Network events omit user IDs; user events include IDs.

Results & Impact

Generated production-scale dataset: 50K users, 2K cells, 5.6M sessions, 104K events. Achieved reproducible outputs with deterministic seeding. Validated referential integrity with foreign key checks. Parquet compression provides efficient storage with embedded schema metadata.

Production Performance

  • Generated 50K users with 5.6M sessions in reproducible fashion
  • Referential integrity validation prevents orphan records
  • Parquet columnar compression optimizes storage
  • Fast development mode: 5K users, 7 days in minutes
  • Full-scale generation: 50K users, 90 days in ~21 minutes

Lessons Learned

What Worked Well

What I'd Do Differently

Future Enhancements

Related Projects