Use Given/When/Then Specs to Make AI Generate Production-Ready Pipelines, Not Spaghetti Code

AI wastes a lot of time when used without careful oversight

LLM code generation dramatically speeds up development until it doesn’t. If you’ve worked with AI generated code, you’ve likely experienced:

Code that does something completely different than what you wanted

Bugs are buried in AI spaghetti code that takes longer to fix than writing it yourself.

Dangerous mishaps: engineers running unsupervised AI code that deletes git history or drops databases, unable to explain what the script actually does

This post shows you how to prevent that. You’ll learn to specify requirements, so LLMs generate exactly what you need. And avoid having to wade through thousands of lines of code to verify correctness.

What you’ll gain:

A framework for constraining LLM requirements to generate correct pipelines
Human-readable documentation that non-technical stakeholders can understand

The Solution: Structured Specifications + Templates

LLMs are great for code generation, but they can produce code that’s incorrect or inconsistent with your codebase.

Describe problems using Gherkin (a constrained syntax ideal for data pipelines) and provide a sample code template to guide the LLM toward your desired structure.

Setup

To follow along, install uv and run the following commands.

git clone https://github.com/josephmachado/bdd_llm.git 
cd bdd_llm
uv python install 3.13
uv sync
uv run behave

1: Clone & cd into the repo for the code examples
2: Use uv to install libraries
3: Run behaviour tests with the behave library

Step 1: Define Requirements with Gherkin

LLMs can generate code from English descriptions, but vague language leads to incorrect code and wasted debugging time.

As a data engineer, you identify inputs, define output schemas, columns, and constraints, and then write transformation code. LLMs can automate that last step, but only if requirements are clear and consistent.

Gherkin provides keywords that constrain English into unambiguous specifications that your team can adopt and that stakeholders can read.

Here’s an example:


Feature: customer_store_metrics Data Pipeline
  As a data engineer
  I want to validate the customer_store_metrics ETL pipeline
  So that I can ensure data quality and correctness

  Background:
    Given the following input data sources 
      | input_data_name | input_data_location |
      | transactions    | transactions.csv    |
      | stores          | stores.csv          |
      | customers       | customers.csv       |

  Scenario: Validate output schema and primary key grain
    When I transform the input data
    Then the output should have primary key grain of "customer_id, store_id"
    And the output schema should match 
      | column_name                 | column_data_type  | desc                                                                                                                           |
      | customer_id                 | IntegerType       | Unique identifier for the customer                                                                                             |
      | customer_name               | StringType        | Full name of the customer                                                                                                      |
      | tier                        | StringType        | Customer tier level (bronze/silver/gold)                                                                                       |
      | signup_date                 | DateType          | Date when customer registered/signed up                                                                                        |
      | store_id                    | IntegerType       | Unique identifier for the store                                                                                                |
      | store_name                  | StringType        | Name of the store location                                                                                                     |
      | region                      | StringType        | Geographic region where store is located                                                                                       |
      | total_transactions          | LongType          | Count of all transactions for this customer at this store                                                                      |
      | total_spend                 | DecimalType(18,2) | Sum of all transaction amounts for this customer at this store                                                                 |
      | avg_transaction_value       | DecimalType(18,2) | Average amount per transaction                                                                                                 |
      | first_purchase_date         | DateType          | Date of customer's first purchase at this store                                                                                |
      | last_purchase_date          | DateType          | Date of customer's most recent purchase at this store                                                                          |
      | avg_monthly_spend           | DecimalType(18,2) | Average spend per month (total spend divided by number of distinct months with transactions)                                   |
      | days_since_last_purchase    | IntegerType       | Number of days from last purchase to current date                                                                              |
      | pct_of_customer_total_spend | DecimalType(10,2) | Percentage of customer's total spend across all stores that occurred at this specific store                                     |
      | spend_growth_rate_90d       | DecimalType(10,4) | Growth rate comparing last 90 days spend vs previous 90 days (days 90-180 ago), expressed as decimal (e.g., 0.25 = 25% growth) |
      | customer_rank_in_store      | IntegerType       | Rank of customer by total spend within this store (1 = highest spender)                                                        |
      | customer_lifetime_value     | DecimalType(18,2) | Total spend by customer across all stores                                                                                      |

1: Feature: High-level pipeline overview and specification intent
2: Background/Given: Sample data for validation. Include data setup as needed. Our join keys are obvious to the LLMs from the data sample, if this is not the case make sure to specify the join criteria here.
3: Scenario: The test case being validated. A Feature can have multiple Scenarios.
4: When: The action or transformation to be performed.
5: Then/And: Then defines the expected output, And is used to add additional expectations of the output. E.g. Then -> PK is valid & AND -> Schema must match given schema.

For details on Gherkin, read the specs here.

Clear column descriptions are critical because they determine the quality of the generated code. Examine the column definition in the Gherkin file above. How can the descriptions be better? Let me know in the comments below.

Note

Use this query to extract samples of input data, with headers, from your production system (anonymize if needed):


SELECT 
    table_1.*
    -- table_2.* for data from table_2
    -- table_3.* for data from table_3
FROM table_1 
JOIN table_2 ON some_join_criteria 
JOIN table_3 ON some_join_criteria
ORDER BY table_1.id
LIMIT 10

1: Get’s all columns from table_1
2: Always ordering by table_1’s id will ensure that the data is joinable when we test locally

In our example, we have the input files transactions.csv, stores.csv, & customers.csv stored in the ./data folder.

Step 2: Create a Code Template

Most data pipelines follow Extract → Transform → Validate → Load. Create a template (and save it in your version control) that shows the LLM your desired code structure.

from pyspark.sql import DataFrame, SparkSession

def extract(spark: SparkSession) -> dict[str, DataFrame]:
    """Function to extract data from input sources and parse them into Spark DataFrame

    Args:
        spark: SparkSession to connect to input sources

    Returns:
        A dictionary with data name as key and Spark DataFrame as value
    """
    pass 

def transform(input_dataframes: dict[str, DataFrame]) -> DataFrame:
    """Function to transform input data into output

    Args:
        input_dataframes: A dictionary with data name as key and Spark DataFrame as value

    Returns:
        The transformed dataframe       
    """
    pass 

def validate(transformed_dataframe: DataFrame) -> bool: 
    """Function to run data quality checks(if any) on the transformed_dataframe 
    before loading it into the final destination

    Args:
        transformed_dataframe: The transformed dataframe

    Returns:
        A boolean that is True if DQ checks pass, else False
    """
    return True 

def load(transformed_dataframe: DataFrame) -> None:
    """Function to load data into the final destination.

    Args:
        transformed_dataframe: The transformed_dataframe that has been quality checked
    """
    pass 

def run(spark: SparkSession) -> None:
    """Function to run Extract, Tranform, Validate and Load functions 
    in the expected order

    Args:
        spark: The SparkSession to connect to a running Spark Application

    Raises:
        ValueError: Raise a value error in case the data quality checks fail
    """
    transformed_dataframe = transform(extract(spark))
    if not validate(transformed_dataframe):
        raise ValueError("DataFrame does not meet DQ expectations")
    load(transformed_dataframe)

if __name__ == '__main__':
    # TODO: Other inputs as arguments
    spark = (
        SparkSession.builder.appName("table_name")
        .enableHiveSupport()
        .getOrCreate()
    )
    spark.sparkContext.setLogLevel("ERROR")
    run(spark)

Tip

The input and output data types, and the function documentation, guide the LLMs in designing the code as per your recommendations.

See the details for the E->T->V->L pattern here.

Step 3: Generate Code and Tests

Provide both the Gherkin spec and template to an LLM, and you’ll get working pipeline code. Here is the prompt that was used.

Given the Gherkin definition (pasted), pipeline template (pasted), 
and sample inputs (attached csv files).

Create a pipeline script that generates the required output, 
assume that the inputs are all Spark tables.

We can create test for the scenarios.

Use the same Gherkin definition to generate tests via the behave library, which executes Gherkin specifications as test cases. Here is the prompt that was used.

Use the Gherkin definition (pasted) to create a python file with 
the tests runnable with the python behave library 
(assume that this is installed). 

Assume that there exists a transform function that transforms 
the input and this is the function to be tested.

Assume the input data are present in a data folder one level up.

Note

Note that these tests are more akin to data quality checks than unit tests.

Gherkin was designed to be a shared document between stakeholders and engineers (BDD), and as such, it reflects the characteristics of data that stakeholders care about.

See it in action: LLM Chat. Click on the images below to see the code generated in detail.

Caution

Make sure to clearly understand the code that LLMs produce, especially for your transformations

Key Takeaways

Code generation is the easy part. The hard part remains: knowing your inputs and defining your expected output (aka data modeling & system design ).

Even without the behave library, the Gherkin pattern (Scenario → Given → When → Then → And) provides a constrained framework for specifying requirements to LLMs, improving your chances of generating correct pipelines. It’s a way of thinking—the tooling just supports it.

Planning to use this approach? Share your experience in the comments below.

AI wastes a lot of time when used without careful oversight

The Solution: Structured Specifications + Templates

Step 1: Define Requirements with Gherkin

Step 2: Create a Code Template

Step 3: Generate Code and Tests

Key Takeaways

Recommended reading