March 2024
A typical data load for me looks like this:
To run a databricks script, though, you have to start up a cluster. This means a 3-4 minute wait. When running multiple pipelines this adds up.
Serverless is probably a term you’ve heard thrown around. It doesn’t actually mean there are no servers, just that developers don’t have to be aware of them. So with Databricks, I have to pick a cluster configuration (e.g. amount of RAM, compute type, etc.), and wait for the VM to start up, and wait for Spark and any python libraries to install. With serverless compute, you don’t have to worry about the VM config because it just scales based on the workload. You pay for what you use, no less*, no more.
*Azure Functions have a generous free tier, so I actually never incurred any cost using them.
And so, I decided to try out serverless ETL with Azure Functions.
I won’t walk you through setting up a Function App and Functions, as there’s a perfectly good Microsoft tutorial for that. I diverged slightly by deploying the function app as Infrastructure as Code (IaC) using this guide.
Overall, getting started was pretty simple:
I used the http template to get started, which means your block of code will run when someone makes an http request to the Function App. Authentication is set at Function level, meaning it needs to be authenticated with the function key to run. I only wanted myself or applications set up by me to be able to run the code.
The local development works really well, and is especially nice after using Databricks as it has no way to do a proper local set up (shame on Databricks for not providing this!)
Next challenge was to connect to our data lake within the function.
For this, we need to install a Python library:
pip install azure-storage-file-datalake
This example reads a json file and uses an account key for authenticating:
from azure.storage.filedatalake import DataLakeServiceClient
account_url = "https://ACCOUNT_NAME.dfs.core.windows.net"
account_key = "ACCOUNT_KEY"
container = "CONTAINER"
json_path = "PATH/TO/FILE.JSON"
service_client = DataLakeServiceClient(account_url, credential=account_key)
file_system_client = service_client.get_file_system_client(container)
file_client = file_system_client.get_file_client(json_path)
downloaded_file = file_client.download_file().readall()
json_data = json.loads(downloaded_file)
My goal initially was just to read in a parquet file, and write it out as a Delta table. Simplest way to do this is using Polars. I’d recommend checking it out if you haven’t tried it. Similar to Pandas but faster and has a more functional API.
pip install polars
And a simple read -> write:
import polars as pl
storage_options = {
"account_name": "ACCOUNT_NAME",
"account_key": "ACCOUNT_KEY"
}
parquet_path = f"az://CONTAINER/path/to/file.parquet"
df = pl.read_parquet(parquet_path, storage_options=storage_options)
delta_path = f"az://CONTAINER/path/to/delta"
df.write_delta(delta_path, storage_options=storage_options)
Success! I deployed this to the Function App in Azure (through VS Code), and set it up to be called from Data Factory. I now had a cheaper, quicker, and fully serverless ETL pipeline.
When running this on an entire database, 90% of the tables loaded successfully, but the rest were failing due to running out of memory as the consumption plan has a max of 1.5GB of RAM. Polars has some streaming options for larger-than-memory datasets but it doesn’t have one for writing Delta tables, although they plan to implement one.
A really cheap option for loading relatively small datasets, and could be used for big datasets if there’s a streaming option available in Polars (e.g. reading/writing csv/parquet). I decided not to stick with it for ETL, but still used it for generating some json config files using database metadata.
Given the RAM limitations in Azure Functions, I plan to try out Azure Container Apps / Container Instances next as they should be able to provide a more robust and flexible serverless experience while still costing next to nothing.