AI wastes a lot of time when used without careful oversight
LLM code generation dramatically speeds up development until it doesn’t. If you’ve worked with AI generated code, you’ve likely experienced:
Code that does something completely different than what you wanted
Bugs are buried in AI spaghetti code that takes longer to fix than writing it yourself.
Dangerous mishaps: engineers running unsupervised AI code that deletes git history or drops databases, unable to explain what the script actually does
This post shows you how to prevent that. You’ll learn to specify requirements, so LLMs generate exactly what you need. And avoid having to wade through thousands of lines of code to verify correctness.
What you’ll gain:
- A framework for constraining LLM requirements to generate correct pipelines
- Human-readable documentation that non-technical stakeholders can understand
The Solution: Structured Specifications + Templates
LLMs are great for code generation, but they can produce code that’s incorrect or inconsistent with your codebase.
Describe problems using Gherkin (a constrained syntax ideal for data pipelines) and provide a sample code template to guide the LLM toward your desired structure.
Setup
To follow along, install uv and run the following commands.
- 1
- Clone & cd into the repo for the code examples
- 2
-
Use
uvto install libraries - 3
-
Run behaviour tests with the
behavelibrary
Step 1: Define Requirements with Gherkin
LLMs can generate code from English descriptions, but vague language leads to incorrect code and wasted debugging time.
As a data engineer, you identify inputs, define output schemas, columns, and constraints, and then write transformation code. LLMs can automate that last step, but only if requirements are clear and consistent.
Gherkin provides keywords that constrain English into unambiguous specifications that your team can adopt and that stakeholders can read.
Here’s an example:
Feature: customer_store_metrics Data Pipeline
As a data engineer
I want to validate the customer_store_metrics ETL pipeline
So that I can ensure data quality and correctness
Background:
Given the following input data sources
| input_data_name | input_data_location |
| transactions | transactions.csv |
| stores | stores.csv |
| customers | customers.csv |
Scenario: Validate output schema and primary key grain
When I transform the input data
Then the output should have primary key grain of "customer_id, store_id"
And the output schema should match
| column_name | column_data_type | desc |
| customer_id | IntegerType | Unique identifier for the customer |
| customer_name | StringType | Full name of the customer |
| tier | StringType | Customer tier level (bronze/silver/gold) |
| signup_date | DateType | Date when customer registered/signed up |
| store_id | IntegerType | Unique identifier for the store |
| store_name | StringType | Name of the store location |
| region | StringType | Geographic region where store is located |
| total_transactions | LongType | Count of all transactions for this customer at this store |
| total_spend | DecimalType(18,2) | Sum of all transaction amounts for this customer at this store |
| avg_transaction_value | DecimalType(18,2) | Average amount per transaction |
| first_purchase_date | DateType | Date of customer's first purchase at this store |
| last_purchase_date | DateType | Date of customer's most recent purchase at this store |
| avg_monthly_spend | DecimalType(18,2) | Average spend per month (total spend divided by number of distinct months with transactions) |
| days_since_last_purchase | IntegerType | Number of days from last purchase to current date |
| pct_of_customer_total_spend | DecimalType(10,2) | Percentage of customer's total spend across all stores that occurred at this specific store |
| spend_growth_rate_90d | DecimalType(10,4) | Growth rate comparing last 90 days spend vs previous 90 days (days 90-180 ago), expressed as decimal (e.g., 0.25 = 25% growth) |
| customer_rank_in_store | IntegerType | Rank of customer by total spend within this store (1 = highest spender) |
| customer_lifetime_value | DecimalType(18,2) | Total spend by customer across all stores |- 1
-
Feature: High-level pipeline overview and specification intent - 2
-
Background/Given: Sample data for validation. Include data setup as needed. Our join keys are obvious to the LLMs from the data sample, if this is not the case make sure to specify the join criteria here. - 3
-
Scenario: The test case being validated. AFeaturecan have multipleScenarios. - 4
-
When: The action or transformation to be performed. - 5
-
Then/And:Thendefines the expected output,Andis used to add additional expectations of the output. E.g. Then -> PK is valid & AND -> Schema must match given schema.
For details on Gherkin, read the specs here.
Clear column descriptions are critical because they determine the quality of the generated code. Examine the column definition in the Gherkin file above. How can the descriptions be better? Let me know in the comments below.
Use this query to extract samples of input data, with headers, from your production system (anonymize if needed):
- 1
- Get’s all columns from table_1
- 2
- Always ordering by table_1’s id will ensure that the data is joinable when we test locally
In our example, we have the input files transactions.csv, stores.csv, & customers.csv stored in the ./data folder.
Step 2: Create a Code Template
Most data pipelines follow Extract → Transform → Validate → Load. Create a template (and save it in your version control) that shows the LLM your desired code structure.
from pyspark.sql import DataFrame, SparkSession
def extract(spark: SparkSession) -> dict[str, DataFrame]:
"""Function to extract data from input sources and parse them into Spark DataFrame
Args:
spark: SparkSession to connect to input sources
Returns:
A dictionary with data name as key and Spark DataFrame as value
"""
pass
def transform(input_dataframes: dict[str, DataFrame]) -> DataFrame:
"""Function to transform input data into output
Args:
input_dataframes: A dictionary with data name as key and Spark DataFrame as value
Returns:
The transformed dataframe
"""
pass
def validate(transformed_dataframe: DataFrame) -> bool:
"""Function to run data quality checks(if any) on the transformed_dataframe
before loading it into the final destination
Args:
transformed_dataframe: The transformed dataframe
Returns:
A boolean that is True if DQ checks pass, else False
"""
return True
def load(transformed_dataframe: DataFrame) -> None:
"""Function to load data into the final destination.
Args:
transformed_dataframe: The transformed_dataframe that has been quality checked
"""
pass
def run(spark: SparkSession) -> None:
"""Function to run Extract, Tranform, Validate and Load functions
in the expected order
Args:
spark: The SparkSession to connect to a running Spark Application
Raises:
ValueError: Raise a value error in case the data quality checks fail
"""
transformed_dataframe = transform(extract(spark))
if not validate(transformed_dataframe):
raise ValueError("DataFrame does not meet DQ expectations")
load(transformed_dataframe)
if __name__ == '__main__':
# TODO: Other inputs as arguments
spark = (
SparkSession.builder.appName("table_name")
.enableHiveSupport()
.getOrCreate()
)
spark.sparkContext.setLogLevel("ERROR")
run(spark)The input and output data types, and the function documentation, guide the LLMs in designing the code as per your recommendations.
See the details for the E->T->V->L pattern here.
Step 3: Generate Code and Tests
Provide both the Gherkin spec and template to an LLM, and you’ll get working pipeline code. Here is the prompt that was used.
We can create test for the scenarios.
Use the same Gherkin definition to generate tests via the behave library, which executes Gherkin specifications as test cases. Here is the prompt that was used.
Use the Gherkin definition (pasted) to create a python file with
the tests runnable with the python behave library
(assume that this is installed).
Assume that there exists a transform function that transforms
the input and this is the function to be tested.
Assume the input data are present in a data folder one level up.Note that these tests are more akin to data quality checks than unit tests.
Gherkin was designed to be a shared document between stakeholders and engineers (BDD), and as such, it reflects the characteristics of data that stakeholders care about.
See it in action: LLM Chat. Click on the images below to see the code generated in detail.
Make sure to clearly understand the code that LLMs produce, especially for your transformations
Key Takeaways
Code generation is the easy part. The hard part remains: knowing your inputs and defining your expected output (aka data modeling & system design ).
Even without the behave library, the Gherkin pattern (Scenario → Given → When → Then → And) provides a constrained framework for specifying requirements to LLMs, improving your chances of generating correct pipelines. It’s a way of thinking—the tooling just supports it.
Planning to use this approach? Share your experience in the comments below.

