Deterministic Multi-Table Telecom Data Generator with Causal Ordering
Telecom Digital Twin is a deterministic, schema-first telecom data generator that produces a multi-table LTE dataset (users, cells, sessions, events) with causal ordering and reproducible outputs. Designed as an MLOps testbed, it generates realistic telecom data without private information.
The system uses cascade-based seeding where seeds are derived from global seed and step ID, ensuring bit-exact reproducibility. It enforces strict schema validation with Apache Parquet for columnar compression and embedded schema metadata. All outputs maintain referential integrity with foreign key validation.
Built for ML-ready data generation, the pipeline produces clean, validated datasets with proper correlations across infrastructure, behavior, performance, and events. It supports fast development mode for quick iteration and full-scale generation for production use.
Production telecom data is proprietary and sensitive, making it unavailable for ML development and testing. Off-the-shelf synthetic data tools lack domain expertise and produce unrealistic correlations. ML pipelines require deterministic, reproducible data for debugging and validation.
Data scientists need realistic telecom datasets for model development without accessing production data. ML pipelines require bit-exact reproducibility to debug failures. Data quality issues (orphan records, type mismatches) break downstream JOINs and model training.
A seven-step pipeline: (1) Schema design with canonical definitions, (2) Cells generation with infrastructure metadata, (3) Users generation with demographics, (4) Behavior generation (user + network), (5) Sessions generation with KPIs and QoE, (6) Events generation with causal signals, (7) Validation with referential integrity checks.
Cascade-based seeding where Seed(Step_N) = F(Seed(Global), Step_ID). Ensures bit-exact reproducibility across runs. All generators use fixed seeds and explicit config values.
Enforces Apache Parquet for all outputs. Provides columnar compression and embedded schema metadata ensuring type consistency (float32 remains float32).
Strict schema enforcement with validation at generation time. Foreign key validation in Step 07 acts as unit test for data integrity, raising exceptions if referential integrity violated.
Generates data with proper causal relationships: infrastructure → behavior → performance → events. Network-level events omit user/session IDs; user events include IDs.
Parquet chosen over CSV for schema enforcement and compression. Deterministic seeding enables reproducible ML pipelines. Schema validation prevents downstream JOIN failures. Causal ordering ensures realistic data relationships. SDV provides domain-informed generation.
Deep dive into the technical implementation with annotated code examples
View Technical DetailsEnsuring bit-exact reproducibility across pipeline runs
Implemented cascade-based seeding where each step derives seed from global seed and step ID. All generators use fixed seeds and explicit config values. Re-running with same config yields identical outputs.
Maintaining referential integrity across multiple tables
Implemented Gatekeeper pattern in Step 07 validation. Validates foreign keys at generation time, raising exceptions if orphan records detected. Acts as unit test for data integrity.
Optimizing storage for large-scale data generation
Enforced Apache Parquet for all outputs. Provides columnar compression and embedded schema metadata. Ensures type consistency across entire pipeline.
Generating realistic telecom data with proper correlations
Used SDV for domain-informed generation. Encoded causal relationships: infrastructure → behavior → performance → events. Network events omit user IDs; user events include IDs.
Generated production-scale dataset: 50K users, 2K cells, 5.6M sessions, 104K events. Achieved reproducible outputs with deterministic seeding. Validated referential integrity with foreign key checks. Parquet compression provides efficient storage with embedded schema metadata.