What is declarative data generation?
How can you generate synthetic data for testing?
What declarative data generation libraries are available for Python?
In this short article I will show you how to utilize a little bit of yaml to generate synthetic data that can be used for testing.
Declarative data generation is a simple yet powerful concept. All you have to do is declare the structure of your target datasets and a library synthesizes the data for you. It is very useful when you are developing new tools, applications and libraries. You can use such data for unit tests, or for previewing some new UI features. The schema of your data can be either defined as generic documents, or relational schema.
An example of a tool that allows you to generate data this way is https://github.com/getsynth/synth.
Another important concept in declarative data generation is the usage of simple functions that generate specific kind of data. In https://github.com/joke2k/faker these are called `formatters`. You will build out your schema utilizing these many formatters.
Okay, so now let's get to the fun part. How do you declaratively generate data? At first, let's take synth into our crosshairs. In the library they propose that you declare your documents in this way:
{
"type": "array",
"length": {
"type": "number",
"constant": 1
},
"content": {
"type": "object",
"id": {
"type": "number",
"id": {}
},
"email": {
"type": "string",
"faker": {
"generator": "safe_email"
}
},
"joined_on": {
"type": "string",
"date_time": {
"format": "%Y-%m-%d",
"subtype": "naive_date",
"begin": "2010-01-01",
"end": "2020-01-01"
}
}
}
}
You can see how concise and simple that is. To generate your data you only need to run `synth generate` command. However, that library does not offer Python support, so you are either forced to use the already provided data generators/formatters, or you have to implement them in Rust yourself. If you are reading this article, you are most likely interested more in Python.
There are multiple libraries in Python that try to tackle this problem. Here are a few of them:
- https://pythonrepo.com/repo/fillmula-jsonclasses-python-deep-learning
- https://github.com/sdv-dev/SDV (includes https://github.com/sdv-dev/CTGAN and https://github.com/sdv-dev/Copulas)
- https://github.com/MTG/DeepConvSep
- https://github.com/tirthajyoti/pydbgen
- https://github.com/databrickslabs/dbldatagen
- https://github.com/matousc89/signalz
SDV is a highly interesting library, it uses similar schema definition, but infers what data should be from the original tables' data.
However, what I was looking for was something very simple and yet flexible, like Synth, but in Python.
This is why I have created a very simple library called Declarative Faker - https://github.com/FranekJemiolo/declarative-faker.
It utilizes Faker formatters and so it is very easy to extend.
Unlike Synth, it is using relational schema that looks like this:
# Schema names have to be unique among multiple directories
tables:
- name: users
size: 1000 # Define how many records you want to generate
columns:
- name: id
formatter: random_int # Use something that will be registered in faker
args: [] # args are optional
kwargs: {} # kwargs also are optional
unique: true # unique elements have to be hashable!
- name: name
formatter: name # if you do not include args,kwargs,unique default will be used, which are [],{},False
- name: country
formatter: current_country
- name: trades
size: 10000
columns:
- name: id
formatter: random_int
unique: true
- name: buyer_id
relationship: # For relationships you do not include any other fields, as they rely on foreign keys
kind: "1-many" # Available are 1-1, 1-many
to: "schema.users.id"
- name: seller_id
relationship:
kind: "1-many"
to: "schema.users.id"
- name: price
formatter: pyfloat
kwargs:
positive: true
- name: trades_report
size: 1000
columns:
- name: id
formatter: random_int
unique: true
- name: trade_id
relationship:
kind: "1-1"
to: "schema.trades.id"
- name: message
formatter: text
- name: buyer_id
relationship:
kind: "1-many"
to: "schema.users.id"
- name: seller_id
relationship:
kind: "1-many"
to: "schema.users.id"
As you can see, it is very easy to read and to define.
To generate the data you run `gen_fake --schemas-dir <your_schema_dir> --out-dir <your_output_dir>` and it creates csvs for each table in every schema file you have defined. There is no support yet for schema generation from SQL database, but you can easily do that yourself by introspecting the database schema and generating this simple yaml.
The yaml structure was inspired by the structure used for model definitions in DBT - https://github.com/dbt-labs/dbt-core, and you can adapt the code to easily work with DBT too (I might add support for that at later stage).
I hope you have found this quick read at least slightly valuable and that I have interested you more in the declarative data generation topic.