
Chaos Engineering with Toddlers and Kubernetes
Chaos Engineering on Kubernetes, a real-life scenario
Data Engineering has always been a rapidly growing field, requiring engineers to learn new technologies at a breakneck speed. It’s one of those industries where learning is a constant, whether that’s learning with the direct use of technology or getting certifications to prove competency. However, that doesn’t necessarily mean that whole systems are changing daily; there are still key components that are important to the overall process.
The first key concept is a scheduler. The scheduler is essential simply because it follows a key component of software engineering: automation. A huge reason engineers are hired is to replace the repetition of actions from a single point(human or machine) and turn it into something that can be done automatically.
In the modern world of Data Engineering, scheduling is usually part of an Orchestrator/ETL overhead. In traditional Linux, you can make it as simple as running scripts based on a cron expression. However, with the modern world of D.E., you need to factor in components like security, storage, reliability, etc., especially if you’re working with a large volume of data and/or working with a client’s data, in which case a simple script + cron expression combination won’t suffice.
To expand upon a scheduler, schedulers are usually built into modern-day Orchestrators, a key component of modern Data Engineering. An Orchestrator is an overseer of your data pipelines, from scheduling to monitoring to error handling and everything in between.
Airflow is the most popular orchestrator in the market and the one I’m most familiar with. And yes, Cron expressions are still a part of Airflow scheduling, but you can also use keywords such as daily, weekly, monthly, etc. to make it easier. Airflow can be run locally, with one’s infrastructure, or MWAA, a fully managed Airflow service hosted by AWS. Airflow’s popularity is also a huge benefit, especially when you run into niche problems that already have the solution online or on a board with experienced members to ask. It reminds me of programming languages that are established in the technology space.
Now, while Orchestrators are powerful in running your code and running tasks, you still need to give them instructions on what to do. And how does one do that? (not a trick question!) To put it simply, a programming language. Like most data engineering services, the default language (but not the only one! See: Java and Scala) is Python, which popular orchestrators like Airflow and Dagster utilize.
For beginners in Data Engineering or programming in general, I highly recommend Python as, strictly speaking syntactically, it is close to speaking the actual language of English. One important thing to note about programming is that it is simply a tool. Your code will do exactly what you ask it to do. This is why you should always think about what you want your code to do/write it out before actually typing it out. As a native English speaker, thinking out loud about what I want my code to do and then translating that into Python makes things very simple.
Like Airflow, Python has a considerable amount of resources so you can learn, whether that’s online documentation, books, videos, courses, etc. It is deeply ingrained in Data Engineering and programming overall.
You will need to use the programming language of your choice to extract data from various sources, such as APIs, public datasets, a website that hosts a file, application logs, etc. Not only is the programming language itself essential, but also the libraries you utilize. Pandas is one of the most important libraries in Python Data Engineering, as you can use it for basically any transformation you desire. Again, Pandas has a vast amount of documentation, community, and other resources.
Lastly, an important Python API for large-scale data processing is PySpark, which is the Python interface for Apache Spark and is a distributed computing framework. Unlike Pandas, which operates on in-memory data, PySpark is meant to handle massive datasets(“Big Data”) across a cluster and efficiently processes both batch and streaming data.
Next, we can’t forget about our actual data. Data is usually stored as file types such as JSON, CSV/TSV, XML, parquet(and surprisingly, even PDFS + Word docs! And yes, they are quite annoying to scrape for data!) These files need to live somewhere. That brings us to our next point: Storage. Storage is a massive part of Data Engineering as files are similar to anything physical: they take up space! What’s the best way to take care of storage?
The most common solution, and one I would argue is most effective, is using a cloud provider, the big three being Amazon Web Services(AWS), Google Cloud Platform(GCP), or Microsoft Azure. I would argue that the best one to use is the one you’re already familiar with or your organization is currently using, as your cloud provider has many more services and resources than just storage, such as security, ETL, computing, networking, and more. A cloud provider is its own ecosystem that can be used for data engineering, application, and analysis needs end-to-end. As you can tell, there is a big learning curve when working with cloud providers.
Once your data is in storage, you clean it up with your programming language, which is up to the ET in ETL(Extract, Transform, Load.) After that, you’ll need somewhere to load it. In most cases, this is a data lake(i.e., S3), data warehouse(i.e., Redshift), or data lakehouse(i.e., Databricks), which is usually where your client or stakeholder will see the data. In my experience, the most common place to load data for clients/stakeholders is a warehouse or a lakehouse. Usually, the data lake will be for your team to store the raw and transformed data. There are also data marketplaces, but the data is generally delivered in one of the ways I previously mentioned.
With data warehouses like Redshift, you have to explicitly set up a schema with the correct column names and data types, which is a hard requirement. Redshift will not hesitate to throw many errors at you if you don’t resolve schema issues.
This is quite beneficial compared to another service I’ve used, GCP, in which files can easily be uploaded, but an error won’t be thrown showing that there’s something wrong with the schema/data types. This will result in an engineer believing everything is fine since the pipeline passed until they see the data and realize there’s a bunch of nulls/empty/corrupted fields. While encountering errors can be frustrating, they should be loud and immediately noticeable than to remain silent and cause hidden issues.
Infrastructure is an integral part of technology, and many data engineers have to dip into skill sets that they are generally not comfortable with. Remember that the Data Engineer title can have vast skill sets that dip into DevOps, Software Engineering, Cloud, Platform, etc. Infrastructure can also be automated using Infrastructure as Code(IaC) tools, specifically Terraform, which is established in the field and has the strengths of other established tools. With Terraform, you can start coding your infrastructure from scratch and build everything you need to. The way to think about Terraform is that if all your resources were somehow destroyed, you could deploy them, and it would be back to normal. Another benefit of Terraform is that if you have existing resources in your platform, you can import them over to Terraform to start keeping track of them and being able to deploy them when necessary.
While some Data Engineering practices are low-to-no code or some code is stuffed into a notebook, it is always essential for your work to exist in version control, such as Github or Gitlab. Having your code base up in a repository keeps everything organized by having everything in a central location. Anyone can go through the codebase and see how your work fits in with the rest of the system. Not only does version control keep track of everything you’ve committed/pushed, but it can be used to deploy your code to a platform using CI/CD.
Automation is an essential part of programming, and one component that helps us with that is Continuous Integration/Continuous Deployment (CI/CD.) CI/CD is a practice that automates the process of testing, building, and deploying code to ensure faster and more reliable software updates. CI/CD is where you can run tests and builds, as well as deploy your code base to a platform. GitHub Actions and Jenkins are some standard CI/CD tools.
As your organization scales, it’s difficult to manually track what is working and what is not. That’s where monitoring and observability come into play. You need to monitor both the status of your pipelines and your data. To monitor the status of your pipelines, you can use your orchestrator to send you a built-in email alert when your task succeeds/fails/retries or a webhook relating to a message service(Slack, Teams, etc.) to send status reports to a custom channel. For the data itself, you can use services like Great Expectations to directly add as a task in your pipeline. Great Expectations will automatically generate a couple of tests for you based on your data, but I’d recommend going through all of them and adding or removing tests as necessary.
The tests pertain to whether there will always be an x number of columns, which types each column would have, min/max/null values, etc. Another useful tool is a data staleness checker, which was custom built by the data team at Rearc and checks when your dataset was last updated/appended. This tool helps us determine whether or not the source is still adding new data.
Now that we’ve reviewed the essential components of the modern data stack, let’s tie everything together and summarize the real-world impacts. While the modern data stack offers powerful capabilities, there are drawbacks that engineers must consider. Below, we’ll break down the key advantages, challenges, and common pitfalls to help make informed decisions about what is important in Data Engineering.
Data Engineering is an ever changing field, but some components are essential for the modern world: orchestrator, cloud provider, IaC, data warehouse/data lakehouse, CI/CD, and programming language + libraries, and more. While this blog covers essential components, it is not an exhaustive list. The tools I’ve talked about have worked well for me in building reliable, scalable, and efficient pipelines, but the needs of each individual or organization vary. The best data stack is one that aligns with your business, data, and team. Regardless, adaptability and continuous learning are surefire ways to be successful in the ever-changing landscape of Data Engineering.
Read more about the latest and greatest work Rearc has been up to.
Chaos Engineering on Kubernetes, a real-life scenario
An overview on the modern data stack.
The Importance of Post-Quantum Ciphers in Application Development
An example of how notebooks can help analysts quickly identify and report cyber incidents.
Tell us more about your custom needs.
We’ll get back to you, really fast
Kick-off meeting