10 Short Rules for a Data Engineer

They will save you headaches and your co-workers will be grateful.

Credit: MichaelGaida/Pixabay
  1. Decouple configuration from implementation. Build Once, Deploy Anywhere. You don’t need two different pieces of code to bring two tables from PostgreSQL to Redshift. Projects like Meltano and Airbyte are good open-source integration tools to not reinvent the wheel.
  2. Make your jobs idempotent. Other people will use your pipelines. If they run your jobs multiple times (there are countless reasons to do that), make sure they will produce the same results as running them just once.
  3. Isolate development and production environments. This will allow you and your team to test the pipelines safely. Do you want to overwrite a production table when developing a new version of a job?.
  4. Keep scalability always in mind. If you work in a start-up, this is very important. Kubernetes + Airflow is an excellent combination when the number of jobs starts to grow.
  5. Learn how to code good CLIs. Implement user-friendly CLIs and use arguments and environment variables to configure them. Tools like Typer make it even funnier.
  6. Isolate job dependencies. Don’t fall into the dependency hell. Virtual environments and containers are great tools to avoid conflicts between packages. And don’t forget to pin your dependencies!.
  7. Make your pipelines easy to configure. Give visibility to your pipelines by moving configuration parameters to a human-readable format. This will add a metadata layer to your pipelines and allow non-Python users to configure them. YAML or JSON files are easier to configure than an Airflow DAG.
  8. Measure resource usage. If you can’t measure it, you can’t improve it. Rely on tools and services like Grafana or Datadog to monitor your jobs.
  9. Don’t expose your secrets. Limit permissions and keep your sensitive variables secured. Do not pass sensitive information around when possible. Kubernetes secrets or a cloud secret manager are better alternatives than Airflow variables.
  10. Document your pipelines, sources, and transformations. Tools like dbt allow you to document and let other people understand your datasets and pipelines. Did you know that you can embed markdown documentation into your Airflow DAGs?.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fran Lozano

Data engineer, software developer, continuous learner, curious, stoic, investor.