Declarative Data Generation in Python

A quick overview of available open source declarative data generation libraries

What is declarative data generation?

How can you generate synthetic data for testing?

What declarative data generation libraries are available for Python?

In this short article I will show you how to utilize a little bit of yaml to generate synthetic data that can be used for testing.

Declarative data generation is a simple yet powerful concept. All you have to do is declare the structure of your target datasets and a library synthesizes the data for you. It is very useful when you are developing new tools, applications and libraries. You can use such data for unit tests, or for previewing some new UI features. The schema of your data can be either defined as generic documents, or relational schema.

An example of a tool that allows you to generate data this way is https://github.com/getsynth/synth.

Another important concept in declarative data generation is the usage of simple functions that generate specific kind of data. In https://github.com/joke2k/faker these are called `formatters`. You will build out your schema utilizing these many formatters.

Okay, so now let's get to the fun part. How do you declaratively generate data? At first, let's take synth into our crosshairs. In the library they propose that you declare your documents in this way:

                                
{
    "type": "array",
    "length": {
        "type": "number",
        "constant": 1
    },
    "content": {
        "type": "object",
        "id": {
            "type": "number",
            "id": {}
        },
        "email": {
            "type": "string",
            "faker": {
                "generator": "safe_email"
            }
        },
        "joined_on": {
            "type": "string",
            "date_time": {
                "format": "%Y-%m-%d",
                "subtype": "naive_date",
                "begin": "2010-01-01",
                "end": "2020-01-01"
            }
        }
    }
}
                                
                            


You can see how concise and simple that is. To generate your data you only need to run `synth generate` command. However, that library does not offer Python support, so you are either forced to use the already provided data generators/formatters, or you have to implement them in Rust yourself. If you are reading this article, you are most likely interested more in Python.

There are multiple libraries in Python that try to tackle this problem. Here are a few of them:
- https://pythonrepo.com/repo/fillmula-jsonclasses-python-deep-learning
- https://github.com/sdv-dev/SDV (includes https://github.com/sdv-dev/CTGAN and https://github.com/sdv-dev/Copulas)
- https://github.com/MTG/DeepConvSep
- https://github.com/tirthajyoti/pydbgen
- https://github.com/databrickslabs/dbldatagen
- https://github.com/matousc89/signalz

SDV is a highly interesting library, it uses similar schema definition, but infers what data should be from the original tables' data.
However, what I was looking for was something very simple and yet flexible, like Synth, but in Python.

This is why I have created a very simple library called Declarative Faker - https://github.com/FranekJemiolo/declarative-faker.

It utilizes Faker formatters and so it is very easy to extend.

Unlike Synth, it is using relational schema that looks like this:

                                
# Schema names have to be unique among multiple directories
tables:
  - name: users
    size: 1000  # Define how many records you want to generate
    columns:
      - name: id
        formatter: random_int  # Use something that will be registered in faker
        args: []  # args are optional
        kwargs: {}  # kwargs also are optional
        unique: true  # unique elements have to be hashable!
      - name: name
        formatter: name  # if you do not include args,kwargs,unique default will be used, which are [],{},False
      - name: country
        formatter: current_country
  - name: trades
    size: 10000
    columns:
      - name: id
        formatter: random_int
        unique: true
      - name: buyer_id
        relationship:  # For relationships you do not include any other fields, as they rely on foreign keys
          kind: "1-many"  # Available are 1-1, 1-many
          to: "schema.users.id"
      - name: seller_id
        relationship:
          kind: "1-many"
          to: "schema.users.id"
      - name: price
        formatter: pyfloat
        kwargs:
          positive: true
  - name: trades_report
    size: 1000
    columns:
      - name: id
        formatter: random_int
        unique: true
      - name: trade_id
        relationship:
          kind: "1-1"
          to: "schema.trades.id"
      - name: message
        formatter: text
      - name: buyer_id
        relationship:
          kind: "1-many"
          to: "schema.users.id"
      - name: seller_id
        relationship:
          kind: "1-many"
          to: "schema.users.id"
                                
                            


As you can see, it is very easy to read and to define.

To generate the data you run `gen_fake --schemas-dir <your_schema_dir> --out-dir <your_output_dir>` and it creates csvs for each table in every schema file you have defined. There is no support yet for schema generation from SQL database, but you can easily do that yourself by introspecting the database schema and generating this simple yaml.

The yaml structure was inspired by the structure used for model definitions in DBT - https://github.com/dbt-labs/dbt-core, and you can adapt the code to easily work with DBT too (I might add support for that at later stage).

I hope you have found this quick read at least slightly valuable and that I have interested you more in the declarative data generation topic.