How to gather requirements for your data project

1. Introduction

Data engineers are often caught off guard by undefined end-user assumptions. As a data engineer, if you feel

Requirements gathering is terrible!

Scope creep kills your ability to deliver on time

Disappointed that you do not get specific requirements

Frustrated dealing with end-users who do not understand the complexities of a data pipeline

Interrupted by changing requirements

Then this post is for you. You will learn to help end-users define requirements, break down the project into small deliverables, deliver iteratively, and handle feature/change requests.

2. Gathering requirements

Here are some steps you can take to ensure the seamless development of your data project. Note that depending on your team/company PMs will perform some of these steps.

The term end-users denotes people who use the output, and stakeholders denote PMs/BAs/managers.

2.1. Identify the end-users

The first step is to identify the end-user(s). The request for the project usually comes up during meetings. Understanding the capabilities and preferences of the end-user is crucial for designing an appropriate solution.

End-users (& their preferences) for data projects are usually one of:

  1. Data analysts/Scientists: SQL, File
  2. Business users: Dashboard, report, Excel
  3. Software engineers: SQL, APIs
  4. External clients: Cloud storage, SFTP/FTP, APIs

2.2. Help end-users define the requirements

Assume the end-users don’t know everything they want or even that they will clearly define it to you. You can’t get 100% of the requirements right the first time.

Talk with the end-user about their objectives and their difficulties. Understanding the current status of end-user operations will provide valuable insights.

The end-user might say “We want X data”. Help the end-user define requirements by asking the following questions:

  1. Business impact: How does having this data impact the business? What is the measurable improvement in the bottom line, business OKR, etc? Knowing the business impact helps in determining if this project is worth doing.
  2. Semantic understanding: What does the data represent? What business process generates this data? Knowing this will help you model the data and understand its relation to other tables in your warehouse.
  3. Data source: Where does the data originate? (an application DB, external vendor via SFTP/Cloud store dumps, API data pull, manual upload, etc).
  4. Frequency of data pipeline: How fresh does the data need to be? (n minutes, hourly, daily, weekly, etc). Is there a business case for not allowing a higher frequency? What is the highest frequency of data load acceptable by end-users?
  5. Historical data: Does historical data need to be stored? When loading data into a warehouse, the answer is usually yes.
  6. Data caveats: Does the data have any caveats? (e.g. seasonality affecting size, data skew, inability to join, or data unavailability). Are there any known issues with upstream data sources, such as late arriving data, or missing data?
  7. Access pattern: How will the end user access the data? Is access via SQL, dashboard tool, APIs, or cloud storage? In the case of SQL or dashboard access, What are the commonly used filter columns (e.g. date, some business entity)? What is the expected access latency?
  8. Business rules check (QA): What data quality metrics do the end-users care about? What are business logic-based data quality checks? Which numeric fields should be checked for divergence (e.g. can’t differ by more than x%) across data pulls?
  9. Data output requirements: What is the data output schema? (column names, API field names, Cloud storage file name/size, etc)

Answering the above questions will give you a good starting point.

Help the end-users feel invested in the project by following the steps below.

  1. Thank end-users for their time/expertise
  2. Update them on progress
  3. Ask & incorporate their feedback
  4. Recommend solutions(or different ways to do things) for their common issues
  5. Acknowledge their help & expertise when presenting the project to a wider audience

End-users who feel invested will root for the project, and help evangelize it. Having end-users who root for the project helps a lot with resource allocation.

Clearly define the requirements, record them (e.g. JIRA, etc), and get sign-off from the stakeholders.

2.3. End-user validation

Provide end-users with sample data they can analyze (if possible, in the same format as the expected output). Ask the end-users to validate the data (with a timeline). End-users have a deep understanding of data distribution and business rules. Validation includes ensuring that the data has expected business metrics, can be “sliced and diced” as needed and is easy to use by the end-user. This also creates an opportunity to observe the access patterns, such as what filters and columns are being used often.

End-user validating the data may create new requirements and business rule checks.

Record any new requirements or changes (e.g. JIRA, etc), and get sign-off from the stakeholders. Do not start work on the transformation logic until you get a sign-off from the stakeholders.

2.4. Deliver iteratively

Break down a large project into smaller parts. Work with the stakeholders to decide on a timeline and prioritization. E.g. If you are building an ELT pipeline (REST API => dashboard), you can split it into modeling the data, pulling data from a REST API, loading it into a raw warehouse table, & building a dashboard for the modeled data.

Delivering in small chunks enables a short feedback cycle from the end-user making changing requirements easy to handle. Track your work (tickets, etc) with clear acceptance criteria.

2.5. Handling changing requirements/new features

Do not accept Adhoc change/feature requests!(unless it’s an emergency). Create a process to

Educate the end-user on the process of requesting a new feature/change. Following a process will prevent scope creep and allow you to deliver on time.

3. Conclusion

Dealing with ever-changing requirements, scope creeps, and unspecified assumptions is a frustrating part of a data engineer’s job. But, following the steps specified in this article can help you navigate the challenges of gathering requirements and delivering on time.

The next time you start a data project, follow the steps shown above to

  1. Deliver on time
  2. Make a huge impact
  3. Make working on the data projects a joy, and
  4. Build supportive end-users

Please leave your questions or comments in the comment section below. If these steps helped, I’d love to hear about how they helped you.

4. Further reading

  1. Re-engineering data pipelines
  2. Responsibilities of a data engineer
  3. Adding tests to your data pipeline

5. Reference

  1. Table of contents generated with markdown-toc
  2. Reddit

Please consider sharing, it helps out a lot!