What is the Easiest Way to Move Data from Stripe to Postgres?
Lately, I find quite a few opportunities to satisfy my curiosity on Twitter/X. The latest one I found in this tweet.
What is the best way to sync data from Stripe to Postgres? I don’t want to write any code or pay anything if possible.
Ideally, something that automatically creates the necessary tables and syncs with a single command, and if it’s open source, even better.
— Xoel López
I believe extracting data from Stripe is a quite common use case in many digital businesses that want to track metrics like their MRR (Monthly Recurring Revenue), or analyze payment data by connecting it with their CRM and/or product backend. In this case, this request could be generalized to “I want to move data from <Insert SaaS here>
to <Insert database here>
" with the following requirements:
- It should be free.
- Without writing a single line of code.
For this case, I assumed real-time data was not needed.
I have been moving data for the last 5 years, and to this day, I’m still not sure which tool to recommend. I’ve developed ad-hoc integrations with Stripe and have used Meltano and Airbyte for similar tasks, but never with the Stripe integration. Both Meltano and Airbyte seem like good tools to me, but I recommended Airbyte because it had a cloud that allowed for a quick proof of concept in a free and visual way through the browser.
After recommending it (mistake, I should have tested the connector first 🤦🏻), I created a trial account in Airbyte Cloud and set up a new connection, but it failed to create with a 502 error. I tried it on different days, so I assumed it wasn’t a temporary issue.
It seems that for this use case, Airbyte (or at least Airbyte Cloud) wasn’t the best option. Another comment in the thread suggested dlt, so I thought it would be a good opportunity to try it out and draw my own conclusions.
Testing dlt
dlt is a Python library for extracting data from multiple sources and loading it into a database or file system. In fact, its name (dlt = data load tool) is very reminiscent of another well-known library for data engineers: dbt (data build tool). dlt aims to be the EL part of the ELT (Extract-Load-Transform) paradigm, leaving the T to dbt.
One of the things that surprised me was its simplicity and lightness. With dlt init stripe_analytics postgres
, I had a fully functional skeleton where I only had to configure a secrets.toml
file to make it work. However, this broke one of the initial problem requirements: Not writing a single line of code. Even though it's configuration, you need to clone the repository, create a virtual environment, install dependencies... and then automate its execution on a remote machine. Certainly a lot of work for a no-code tool.
So, I set out to develop an application to deploy an automated pipeline that works simply using the browser.
To start, I modified the main script generated by dlt with a much more configurable interface: Load type (full load and incremental) and time intervals (start and end date). I don’t like incremental pipelines that use a state because they make it difficult to do backfilling of data when, for example, the data wasn’t ready at the source (Very well explained in this article).
I also added a Docker image so it could be run without needing to create a virtual environment or install dependencies and could be configured with environment variables. This is very useful when running jobs in Kubernetes (point 5 of my 10 rules for a data engineer).
To meet the first requirement (it should be free), I used GitHub Actions, which gives you 2000 minutes per month for free, to automate the loading process. To host the PostgreSQL database, I chose Tembo, which offers free PostgreSQL instances with 0.25 vCPU and 1 GB of memory and supports the Hydra extension in its DataWarehouse stack (because a data warehouse can be “just Postgres”).
With all this, you could automate the extraction and loading of Stripe to PostgreSQL without leaving the browser simply by:
- Forking this repository on Github.
- Configuring the environment variables as GitHub secrets.
I have to admit that I cheated a bit, and this works well as a proof of concept, but if you want to tweak something in the pipeline, you already have to get your hands dirty and touch some code. Even so, dlt is so well designed that just by changing a couple of variables, you can make the pipeline write to a different database (like DuckDB). A real discovery and one that I will probably recommend from now on.
See you soon!