Synthetic Data Quality Assurance 🔎

Documentation | Sample Reports | Technical White Paper

Assess the fidelity and novelty of synthetic samples with respect to original samples:

calculate a rich set of accuracy, similarity and distance metrics
visualize statistics for easy comparison to training and holdout samples
generate a standalone, easy-to-share, easy-to-read HTML summary report

...all with a few lines of Python code 💥.

mostlyai-qa.mp4

Installation

The latest release of mostlyai-qa can be installed via pip:

pip install -U mostlyai-qa

On Linux, one can explicitly install the CPU-only variant of torch together with mostlyai-qa:

pip install -U torch==2.6.0+cpu torchvision==0.21.0+cpu mostlyai-qa --extra-index-url https://download.pytorch.org/whl/cpu

Quick Start

import pandas as pd
import webbrowser
from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# fetch original + synthetic data
base_url = "https://github.com/mostly-ai/mostlyai-qa/raw/refs/heads/main/examples/quick-start"
syn = pd.read_csv(f"{base_url}/census2k-syn_mostly.csv.gz")
# syn = pd.read_csv(f'{base_url}/census2k-syn_flip30.csv.gz') # a 30% perturbation of trn
trn = pd.read_csv(f"{base_url}/census2k-trn.csv.gz")
hol = pd.read_csv(f"{base_url}/census2k-hol.csv.gz")

# calculate metrics
report_path, metrics = qa.report(
    syn_tgt_data=syn,
    trn_tgt_data=trn,
    hol_tgt_data=hol,
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

Basic Usage

from mostlyai import qa

# initialize logging to stdout
qa.init_logging()

# analyze single-table data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
)

# analyze sequential data
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    tgt_context_key = "user_id",
)

# analyze sequential data with context
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_df,
    trn_tgt_data = training_df,
    hol_tgt_data = holdout_df,  # optional
    syn_ctx_data = synthetic_context_df,
    trn_ctx_data = training_context_df,
    hol_ctx_data = holdout_context_df,  # optional
    ctx_primary_key = "id",
    tgt_context_key = "user_id",
)

Sample Reports

Baseball Players (Flat Data)
Baseball Seasons (Sequential Data)

Citation

Please consider citing our project if you find it useful:

@misc{mostlyai-qa,
      title={Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework},
      author={Andrey Sidorenko and Michael Platzer and Mario Scriminaci and Paul Tiwald},
      year={2025},
      eprint={2504.01908},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.01908},
}

Name	Name	Last commit message	Last commit date
Latest commit mostlyai-ci build: release 1.9.5 (#201 ) May 16, 2025 a13a117 · May 16, 2025 History 183 Commits
.github	.github	build: deprecate `cpu` and `gpu` extras (#155 )	Apr 11, 2025
docs	docs	feat: use encoded data consistently for similarity & distances (#187 )	May 11, 2025
examples	examples	feat: use encoded data consistently for similarity & distances (#187 )	May 11, 2025
mostlyai/qa	mostlyai/qa	build: release 1.9.5 (#201 )	May 16, 2025
tests	tests	chore: dedicated data pull for embeddings to fetch all events for seq…	May 12, 2025
.gitignore	.gitignore	chore: .gitignore and pyproject.toml regrouping (#65 )	Dec 17, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	chore: upgrade ruff in precommit config (#165 )	Apr 17, 2025
CONTRIBUTING.md	CONTRIBUTING.md	docs: add `CONTRIBUTING.md` (#98 )	Feb 4, 2025
LICENSE	LICENSE	initial commit	Nov 14, 2024
Makefile	Makefile	build: uv run without re-syncing the environment (#105 )	Feb 11, 2025
README.md	README.md	docs: adapt to model2vec (#160 )	Apr 14, 2025
mkdocs.yml	mkdocs.yml	feat: use encoded data consistently for similarity & distances (#187 )	May 11, 2025
pyproject.toml	pyproject.toml	build: release 1.9.5 (#201 )	May 16, 2025
uv.lock	uv.lock	chore: remove outdated plotly file (#200 )	May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Quality Assurance 🔎

Installation

Quick Start

Basic Usage

Sample Reports

Citation

About

Releases 43

Contributors 9

Languages

License

mostly-ai/mostlyai-qa

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Quality Assurance 🔎

Installation

Quick Start

Basic Usage

Sample Reports

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 43

Contributors 9

Languages