10 Short Rules for a Data Engineer
— Hey, yesterday’s job failed. How can I recover the data?
— Oh! That’s bad. Well, first you need to get a list of all the files that have been partially updated in S3, delete the files and their corresponding rows in the table with a “DELETE FROM … WHERE” command. Then, you can rerun the job.
— Why can’t I just rerun the job?
—Because you could create duplicates in the table.
It was just an example, but you should never have a conversation like this with a co-worker. As a data engineer, I have dealt with thorny pipelines that have taught me valuable lessons. Here are ten rules that have helped me to develop scalable and resilient jobs. I have learned these rules from experience, breaking things, triggering alarms, and spending long hours debugging. These rules are biased to my experience with Python, Airflow, and Kubernetes, but they still may be helpful in your career as a data engineer.
- Decouple configuration from implementation. Build Once, Deploy Anywhere. You don’t need two different pieces of code to bring two tables from PostgreSQL to Redshift. Projects like Meltano and Airbyte are good open-source integration tools to not reinvent the wheel.
- Make your jobs idempotent. Other people will use your pipelines. If they run your jobs multiple times (there are countless reasons to do that), make sure they will produce the same results as running them just once.
- Isolate development and production environments. This will allow you and your team to test the pipelines safely. Do you want to overwrite a production table when developing a new version of a job?.
- Keep scalability always in mind. If you work in a start-up, this is very important. Kubernetes + Airflow is an excellent combination when the number of jobs starts to grow.
- Learn how to code good CLIs. Implement user-friendly CLIs and use arguments and environment variables to configure them. Tools like Typer make it even funnier.
- Isolate job dependencies. Don’t fall into the dependency hell. Virtual environments and containers are great tools to avoid conflicts between packages. And don’t forget to pin your dependencies!.
- Make your pipelines easy to configure. Give visibility to your pipelines by moving configuration parameters to a human-readable format. This will add a metadata layer to your pipelines and allow non-Python users to configure them. YAML or JSON files are easier to configure than an Airflow DAG.
- Measure resource usage. If you can’t measure it, you can’t improve it. Rely on tools and services like Grafana or Datadog to monitor your jobs.
- Don’t expose your secrets. Limit permissions and keep your sensitive variables secured. Do not pass sensitive information around when possible. Kubernetes secrets or a cloud secret manager are better alternatives than Airflow variables.
- Document your pipelines, sources, and transformations. Tools like dbt allow you to document and let other people understand your datasets and pipelines. Did you know that you can embed markdown documentation into your Airflow DAGs?.