How to gather requirements to re-engineer a legacy data pipeline

Introduction

As data engineers, you will have to re-engineer legacy data pipelines. While re-engineering data pipelines, if you have struggled with

a lack of clarity of deliverables among the project’s stakeholders.

constantly being questioned as to what you are working on and why it’s worth spending time on it.

Then this post is for you. In this post, we go over the steps you can take to ensure that your data pipeline re-engineering work has a huge impact and is really helpful to your end users.

Gathering requirements

If you are tasked with re-engineering a legacy data pipeline, do not directly jump into designing and implementing the data pipeline upgrades. Instead, take the time to deeply assess the data pipeline, find its faults, and understand why you would want to re-engineer a working data pipeline. We will go over 6 steps that you can take to maximize your impact and identify the core issues to fix.

0. Understand the current state of the data pipeline

Whether you are revamping an existing data pipeline or automating a manual process, it’s important to first understand what is currently being done. Try drawing out the data flow through your data pipeline and note any side effects and dependencies. Make sure to understand the needs of the systems/end users that depend on the data produced by this data pipeline.

1. Think like the end user

The first step when working with any data pipeline is to understand the end user. Your end user may be other engineers, analysts, non-technical employees, external clients etc. Make sure to understand

  1. Who you end user(s) are
  2. How they currently use the data?
  3. How are(If) they notified of any new data/failures?
  4. What issues do they face when handling this data?

Try to do a simulation of how they would interact with the end product, without using your inherent knowledge about the working of the data pipeline. For example, you might know that the data pipeline runs every 6 hours but the end user may not always be aware of this. Read the documents meant for your end users and meet with them to get an idea of how they use this data.

2. Know the why

Before starting work on the data pipeline, make sure to understand the key shortcomings of the current data pipeline. Ideally, solving this shortcoming(s) should significantly impact the end user. Some common reasons for re-engineering data pipelines are

  1. speed of data delivery
  2. automation
  3. cost of data processing
  4. data quality issues
  5. adding new features
  6. add monitoring/logging/metrics
  7. flaky pipelines

3. End user interviews

You would already have some ideas of the major shortcomings of the legacy data pipeline, but it is crucial to talk to the end users about what they would like to see. It is a good idea to validate any assumptions that you might have made regarding the use of data produced by the legacy data pipeline. Make sure to talk about

  1. How they currently use the data ?
  2. What issues they face with accessing the data ? Is the data available too late, data quality, etc
  3. What would be the top 3 features they would want from this data pipeline ?
  4. What would help them deliver faster results, be more confident with the data ?

Also, talk to your end users about the engineering concerns for the data pipeline and why it matters. By the end of this section, you will have a good idea of the high priority features expected from the data pipeline.

4. Reduce the scope

Based on the learnings from the previous section, create a high level design for the data pipeline that address all end user concerns and any engineering concerns. Create a list of the top 3 (this may vary depending on your team size and sprint lengths) highest priority features required by the end user. You should also add as many engineering crucial features as necessary to this list. Typically, data engineers would want to build things that are clean, efficient, distributed, idempotent, etc. But depending on your project timelines, this maybe not always possible.

Take some time to think deeply about these crucial tasks. Are they absolutely necessary? What is the work that provides the biggest return in terms of data pipeline stability or new features? The prioritization of these tasks will get easier with experience and as you get to know your end users. Sometimes, there might not be a clear way to prioritize a task. In such cases pick the foundational work.

Make sure you are not unnecessarily throwing away existing legacy code. This usually contains institutional knowledge about your data.

5. End user walkthrough for proposed solution

Now that you have a list of high priority features that you are going to deliver for the end user, create a simple flow chart/ doc/ readme of the user flow, ie) how they interact with the data pipeline, how they would use the data and the new features.

Walk the end user through a typical workflow. You will probably learn new things about how the user interacts with the system, what they expect, etc. Be very detailed in your walk through. A lot of the times, the batchy nature of data pipelines causes confusion in the end user. Make changes to the scope if necessary.

6. Timelines & deliverables

After the user walkthrough, do another round of prioritizing the work tasks. Work with the product team to figure out short term deliverables and a long term plan to deliver all the necessary features. Make sure that the short term deliverables along with the timelines are available for anyone to view, especially the end users. This is typically done using a software like JIRA and should be managed as necessary.

Deliver iteratively

Having a set timeline, delivery schedule, and correct expectations communicated with your end users, you will have a much easier time delivering value with your data pipelines. Make sure to deliver iteratively. This helps you course correct as needed. The components you build should be modular. As things/priorities change your code can be reused.

There will be cases where some items will need to be de prioritized, changed, not done, etc. Having modular code will help you reuse code as requirements change/evolve. For example if you are building an API service for your metadata, make sure to build the data layer first and the API layer afterward. The API specs may change but the data layer rarely do.

Conclusion

Hope this article gives you a good understanding of how to go about re-engineering a data pipeline. The key is to understand and empathize with the end user and deliver features that have the highest impact. The next time you are re-engineering or building a data pipeline start with the end user first and then design the data pipeline.

Please feel free to leave any questions or comments in the comment section below.

Further reading

  1. What is a data warehouse
  2. What is staging
  3. 6 job responsibilities of a data engineer

References

  1. Things you should never do