Blog

Modern Data Stack: A Deep Dive

Modern Data Stack: A Deep Dive

Data Engineering has always been a rapidly growing field, requiring engineers to learn new technologies at a breakneck speed. It’s one of those industries where learning is a constant, whether that’s learning with the direct use of technology or getting certifications to prove competency. However, that doesn’t necessarily mean that whole systems are changing daily; there are still key components that are important to the overall process.

Scheduler/Orchestrator

The first key concept is a scheduler. The scheduler is essential simply because it follows a key component of software engineering: automation. A huge reason engineers are hired is to replace the repetition of actions from a single point(human or machine) and turn it into something that can be done automatically.

In the modern world of Data Engineering, scheduling is usually part of an Orchestrator/ETL overhead. In traditional Linux, you can make it as simple as running scripts based on a cron expression. However, with the modern world of D.E., you need to factor in components like security, storage, reliability, etc., especially if you’re working with a large volume of data and/or working with a client’s data, in which case a simple script + cron expression combination won’t suffice.

To expand upon a scheduler, schedulers are usually built into modern-day Orchestrators, a key component of modern Data Engineering. An Orchestrator is an overseer of your data pipelines, from scheduling to monitoring to error handling and everything in between.

Airflow is the most popular orchestrator in the market and the one I’m most familiar with. And yes, Cron expressions are still a part of Airflow scheduling, but you can also use keywords such as daily, weekly, monthly, etc. to make it easier. Airflow can be run locally, with one’s infrastructure, or MWAA, a fully managed Airflow service hosted by AWS. Airflow’s popularity is also a huge benefit, especially when you run into niche problems that already have the solution online or on a board with experienced members to ask. It reminds me of programming languages that are established in the technology space.

Programming Language

Now, while Orchestrators are powerful in running your code and running tasks, you still need to give them instructions on what to do. And how does one do that? (not a trick question!) To put it simply, a programming language. Like most data engineering services, the default language (but not the only one! See: Java and Scala) is Python, which popular orchestrators like Airflow and Dagster utilize.

For beginners in Data Engineering or programming in general, I highly recommend Python as, strictly speaking syntactically, it is close to speaking the actual language of English. One important thing to note about programming is that it is simply a tool. Your code will do exactly what you ask it to do. This is why you should always think about what you want your code to do/write it out before actually typing it out. As a native English speaker, thinking out loud about what I want my code to do and then translating that into Python makes things very simple.

Like Airflow, Python has a considerable amount of resources so you can learn, whether that’s online documentation, books, videos, courses, etc. It is deeply ingrained in Data Engineering and programming overall.

You will need to use the programming language of your choice to extract data from various sources, such as APIs, public datasets, a website that hosts a file, application logs, etc. Not only is the programming language itself essential, but also the libraries you utilize. Pandas is one of the most important libraries in Python Data Engineering, as you can use it for basically any transformation you desire. Again, Pandas has a vast amount of documentation, community, and other resources.

Lastly, an important Python API for large-scale data processing is PySpark, which is the Python interface for Apache Spark and is a distributed computing framework. Unlike Pandas, which operates on in-memory data, PySpark is meant to handle massive datasets(“Big Data”) across a cluster and efficiently processes both batch and streaming data.

Cloud Provider

Next, we can’t forget about our actual data. Data is usually stored as file types such as JSON, CSV/TSV, XML, parquet(and surprisingly, even PDFS + Word docs! And yes, they are quite annoying to scrape for data!) These files need to live somewhere. That brings us to our next point: Storage. Storage is a massive part of Data Engineering as files are similar to anything physical: they take up space! What’s the best way to take care of storage?

The most common solution, and one I would argue is most effective, is using a cloud provider, the big three being Amazon Web Services(AWS), Google Cloud Platform(GCP), or Microsoft Azure. I would argue that the best one to use is the one you’re already familiar with or your organization is currently using, as your cloud provider has many more services and resources than just storage, such as security, ETL, computing, networking, and more. A cloud provider is its own ecosystem that can be used for data engineering, application, and analysis needs end-to-end. As you can tell, there is a big learning curve when working with cloud providers.

Loading your Data

Once your data is in storage, you clean it up with your programming language, which is up to the ET in ETL(Extract, Transform, Load.) After that, you’ll need somewhere to load it. In most cases, this is a data lake(i.e., S3), data warehouse(i.e., Redshift), or data lakehouse(i.e., Databricks), which is usually where your client or stakeholder will see the data. In my experience, the most common place to load data for clients/stakeholders is a warehouse or a lakehouse. Usually, the data lake will be for your team to store the raw and transformed data. There are also data marketplaces, but the data is generally delivered in one of the ways I previously mentioned.

With data warehouses like Redshift, you have to explicitly set up a schema with the correct column names and data types, which is a hard requirement. Redshift will not hesitate to throw many errors at you if you don’t resolve schema issues.

This is quite beneficial compared to another service I’ve used, GCP, in which files can easily be uploaded, but an error won’t be thrown showing that there’s something wrong with the schema/data types. This will result in an engineer believing everything is fine since the pipeline passed until they see the data and realize there’s a bunch of nulls/empty/corrupted fields. While encountering errors can be frustrating, they should be loud and immediately noticeable than to remain silent and cause hidden issues.

Infrastructure (as Code)

Infrastructure is an integral part of technology, and many data engineers have to dip into skill sets that they are generally not comfortable with. Remember that the Data Engineer title can have vast skill sets that dip into DevOps, Software Engineering, Cloud, Platform, etc. Infrastructure can also be automated using Infrastructure as Code(IaC) tools, specifically Terraform, which is established in the field and has the strengths of other established tools. With Terraform, you can start coding your infrastructure from scratch and build everything you need to. The way to think about Terraform is that if all your resources were somehow destroyed, you could deploy them, and it would be back to normal. Another benefit of Terraform is that if you have existing resources in your platform, you can import them over to Terraform to start keeping track of them and being able to deploy them when necessary.

Version Control + CI/CD

While some Data Engineering practices are low-to-no code or some code is stuffed into a notebook, it is always essential for your work to exist in version control, such as Github or Gitlab. Having your code base up in a repository keeps everything organized by having everything in a central location. Anyone can go through the codebase and see how your work fits in with the rest of the system. Not only does version control keep track of everything you’ve committed/pushed, but it can be used to deploy your code to a platform using CI/CD.

Automation is an essential part of programming, and one component that helps us with that is Continuous Integration/Continuous Deployment (CI/CD.) CI/CD is a practice that automates the process of testing, building, and deploying code to ensure faster and more reliable software updates. CI/CD is where you can run tests and builds, as well as deploy your code base to a platform. GitHub Actions and Jenkins are some standard CI/CD tools.

Monitoring & Observability

As your organization scales, it’s difficult to manually track what is working and what is not. That’s where monitoring and observability come into play. You need to monitor both the status of your pipelines and your data. To monitor the status of your pipelines, you can use your orchestrator to send you a built-in email alert when your task succeeds/fails/retries or a webhook relating to a message service(Slack, Teams, etc.) to send status reports to a custom channel. For the data itself, you can use services like Great Expectations to directly add as a task in your pipeline. Great Expectations will automatically generate a couple of tests for you based on your data, but I’d recommend going through all of them and adding or removing tests as necessary.

The tests pertain to whether there will always be an x number of columns, which types each column would have, min/max/null values, etc. Another useful tool is a data staleness checker, which was custom built by the data team at Rearc and checks when your dataset was last updated/appended. This tool helps us determine whether or not the source is still adding new data.

Summary

Now that we’ve reviewed the essential components of the modern data stack, let’s tie everything together and summarize the real-world impacts. While the modern data stack offers powerful capabilities, there are drawbacks that engineers must consider. Below, we’ll break down the key advantages, challenges, and common pitfalls to help make informed decisions about what is important in Data Engineering.

Pros of the Modern Data Stack

  • Abundant resources, community, and documentation
    • See: Python, Airflow, PySpark, AWS, Databricks
  • Almost every Data Engineering problem is solvable
    • Wide variety of tools tailored for specific use cases
  • Modern technology does the difficult work for you
    • Programming and setting up ETL, infrastructure, computing, and other resources isn’t too complex; you just need to figure out what problem to solve
  • Technical support from Cloud Providers
    • 24/7 assistance(based on plan)
    • Access to best practices and architecture guidance

Cons of Modern Data Stack

  • Vendor lock-in
    • Ex: Once you’ve set everything up in a cloud provider ecosystem and want to migrate, it will be difficult due to the learning curve and the vast number of different services.
  • Abundance of tools
    • More potential for bugs and inconsistencies
    • Versioning Issues – Compatibility challenges between different tools and updates.
    • Integration Overhead – Significant effort is required to connect and maintain interoperability across the stack
  • Cost and need of capital
    • Platforms, services, and other tools run your bill high, but rarely are any free. You will need a sufficient amount of capital.
    • Watch for expenses when scaling and also for usage increases for a tool.

Common Mistakes

  • Overengineering!
    • While it is impressive to engineer a product well, remember that your business has clients/stakeholders who may be focused more on the end result than the process.
    • It can exhaust time, money, and other resources.
      • Once you’ve finished the task(s) that the stakeholder is looking for, you can follow up with them and share the ideas that you want to engineer out. Then, “overengineering” isn’t an issue and is more of a thought process to expand your product and improve it..
  • Lack of cost awareness
    • When scaling up, it is easy to lose track of how many services and tools are draining your capital. An engineer can work with shareholders to cut out or reduce costs. i.e. Switching from a server-based data warehouse to a serverless one so that resources can be scaled dynamically
  • Lack of proper documentation
    • Documentation needs to be valuable and concise for tasks such as setup tutorials.
    • Conciseness: Improper or overuse of documentation will lead to fewer eyes reading or will become lost in a sea of documentation. Make sure the documentation you create is read by people, and keep it simple.
  • Skipping Data Quality checks
    • Bad data is useless and leads to inaccurate analytics.
    • Missing values, duplicates, or inconsistencies may go unnoticed until they cause failures.
    • Stakeholders may lose confidence in data.

Conclusion

Data Engineering is an ever changing field, but some components are essential for the modern world: orchestrator, cloud provider, IaC, data warehouse/data lakehouse, CI/CD, and programming language + libraries, and more. While this blog covers essential components, it is not an exhaustive list. The tools I’ve talked about have worked well for me in building reliable, scalable, and efficient pipelines, but the needs of each individual or organization vary. The best data stack is one that aligns with your business, data, and team. Regardless, adaptability and continuous learning are surefire ways to be successful in the ever-changing landscape of Data Engineering.

Next steps

Ready to talk about your next project?

1

Tell us more about your custom needs.

2

We’ll get back to you, really fast

3

Kick-off meeting

Let's Talk