sim2ml — Synthetic Dataset Generation for ML Testing
December 31, 2022
sim2ml (Scenario Manipulator) is a toolchain for generating synthetic datasets aimed at facilitating search-based testing of machine learning systems. It targets traffic monitoring systems and measures how environmental parameters such as weather, lighting, and time-of-day affect the accuracy of a license plate detector.
Scenario Creation
Scenario Generation
How It Works
- A base
scenario.JSONis defined in the BERGE simulator - The ScenarioManipulator vectorizes the scenario parameters (lighting, time, paths)
- The tool mutates the scenario by applying noise vectors to generate diverse variants
- Each variant is rendered in BERGE and evaluated using the CAMEA LP detector
- Metrics (IoU, Levenshtein distance, box score, OCR score) are collected per frame
- A linear regression model is trained to find the effect of each parameter on detection quality
Sensors Supported
The BERGE simulator offers cameras, distorted cameras, depth sensors, radars, and semantic segmentation sensors.
Metrics
For each frame image, four metrics are tracked:
- iou — intersection over union of detected plate bounding boxes
- lev — Levenshtein distance between detected and ground-truth plate text
- box_score — bounding box confidence
- ocr_score — OCR confidence
Funding
Carried out under the VALU3S project, funded by ECSEL Joint Undertaking (JU) under grant agreement No 876852 (EU Horizon 2020).