Building a data application with DuckDB
Visualizing the car and motorcycle market in Spain from my computer.
DuckDB is a technology that has gained a lot of attention in recent months. I myself discovered its full potential a few months ago when I was able to query some parquet files from my terminal in a very simple way.
Used to working with parquet files in my day-to-day work, using the DuckDB CLI significantly increased my productivity. I could check that the schema of the tables was correct, add data in a matter of seconds, and even read the files directly from S3. Quite a discovery.
But why is DuckDB so revolutionary?
As Kojo explains in their blog, DuckDB fills the gap in embedded databases for online analytical processing (OLAP).
It is no longer necessary to use a stand-alone data warehouse like Snowflake or BigQuery to develop an application that performs complex analytical queries. With DuckDB you have the simplicity of SQLite and the functionality of Snowflake, on your local machine. This opens up a whole new world of possibilities for developing data applications with open-source tools.
This trend towards the creation of data applications has been reflected recently this year with the acquisition of Streamlit by Snowflake. DuckDB is not only limited to the development of embedded data applications, but can also be used for rapid prototyping with tools such as Streamlit, and then migrate to Snowflake or another data warehouse to serve a web application.
On the plus side, dbt has integration with DuckDB, and take my word for it, it is blazing fast. This will allow you to use dbt to transform data without having to run a local database like PostgreSQL. In addition, dbt has the ability to snapshot your data, which allows you to keep the history of changes in a table to analyze data over time. You can even query data from DataGrip (or PyCharm Professional), which makes it especially attractive for data application development.
In the middle of the year, I felt the need to buy my first car. In some big cities like Madrid, the regulation for polluting cars is getting stricter and stricter, so I opted to buy a hybrid car. But there was a problem: The second-hand market for hybrid cars is not as developed as the traditional fuel market, and that meant that there were not many good opportunities. I hypothesized that there was a lack of supply of hybrid second-hand cars, so I wanted to analyze it and see if it was possible to find opportunities.
Here I could see that the percentage of new hybrid cars (green and purple, second and third charts), was greater than used hybrid cars (first chart), which barely reached 5%.
At this point, I wanted to go a little further and see if it was possible to build an application to track the change in car prices over time. I managed to convince my friend Jorge, who was interested in buying a vehicle (a motorcycle instead of a car), and I presented him with a design of what I wanted the application to look like.
Jorge, always willing to learn and work on new things, was fine with it, so we got down to work.
When we started, we only had a Bash script, and a Jupyter notebook with some charts, but we needed to create an interactive application. This is when we decided to use DuckDB as the application database because of its speed and simplicity.
In addition, we needed to have a snapshot of prices every day so we could chart their evolution and spot opportunities. With these requirements, I decided to use Dagster instead of Airflow because I wanted to test this platform, which was also well integrated with dbt and Jorge already had experience with it. We decided to use dbt for the creation of the tables that were going to be read by the application as well as the snapshots.
We migrated the Bash script to a Python module and Jorge adapted it to also download the motorcycle dataset. Using macros from dbt, we reused a lot of code in the data modeling since the model was practically identical. The dependencies between tasks we defined in Dagster, by creating a job.
Jorge was in charge of the development of the visualization application. For this purpose, he used Dash, a framework for building data apps in Python. With this tool, we were able to explore from different perspectives the car and motorcycle market in Spain, and track changes in prices.
Finally, we prepared the application so that it could be deployed with a single command using Docker Compose. We also tried to deploy it on an EC2 machine so that it would refresh the database data daily in an automated way, but we ran into a problem: AWS IPs were blocked in coches.net, so we couldn’t automate this last step.
Thanks to open-source tools and especially DuckDB, It was possible to build a data application in a simple way. I also discovered that the second-hand market for hybrid cars in Spain still has to grow and lacks of supply compared to fuel cars, which made it difficult to find good opportunities.
You can find the full code of the application on Github. See you soon!