Deploying Prefect Data Pipeline Directly from GitHub
Prefect orchestration of data pipeline with codes stored at GitHub.
Prefect Deployments provides a neat solution to store data pipelines in GitHub (also S3, GCS, Azure and others). As part of the DE Zoomcamp 2023 week-2 homework (topic: workflow orchestration), I had to try out prefect-github integration.
There were couple of gotchas!
Gotcha 1: You need to have the python conde locally for the initial deployment. (After that, any changes in github repo will be effective in the pipeline automatically.)
Gotcha 2: The local python script has to be in the same relative folder structure. More on this below.
Below I describe how I set up the orchestration using GitHub:
Step 1: Upload the code to a github repository. My public repository was: link.
Step 2: Open shell, browse to the folder where a copy of the python script is already present (or the script is within a folder that mirrors folder structure of the github repo). In my case, I created a folder named youtube_github and put the local copy of my script etl_web_to_gcs.py there.
Step 3: Run the prefect deployment command, as below:
prefect deployment build etl_web_to_gcs.py:etl_parent_flow -sb github/github-hw-2-de-zoomcamp -n github_deployment_parent_flow -a
*Notes on gotcha 2: If your code is in a sub-folder in github (lets say its in a subfolder named “flows”), then you need to put the etl_web_to_gcs.py within a local folder named “flows” and modify the deployment command as follows:
prefect deployment build ./flows/etl_web_to_gcs.py:etl_parent_flow -sb github/github-hw-2-de-zoomcamp -n github_deployment_parent_flow -a
Step 4: Run the agent that will pickup the deployment:
prefect agent start --work-queue "default"
Step 5: Run the prefect orion server
prefect orion start
Step 6: Provide provide parameters for the run. I did a quick run from the web UI. You can also do a custom run.
Notes to myself:
Some bugs that I need to reproduce (prefect version 2.7.7):
Prefect orion UI did not pass default values to quick runs.
If deafualt value is provided as modified value, it defaults to default value, but when run as quick run, the run fails saying no value was passed. Example, default value for year was 2020. Whenever, I tried to use 2020 (as default in year), Quick run from the UI failed saying year is a required value. When I modified the year parameter from the UI and still provided 2020 (as the modified year value), the run failed saying year is a required parameter. Workaroud: I modified the default year to 1900 so that I can at least run 2020 from the UI.
When I did not provide any default values, the month parameter from the UI got messed up! For example, when I passed month as a list: [11,12], the web UI passed the parameter as “[11,12]” (which is a wrong data type, and not a list as [11,12]). Not surprisingly, the quick run failed. As a workaround, now I provided [19,20] as the default months list. From this, whenever I pass months - they are passed as list parameters (and at least, the code works!).
The yaml file in “prefect deployment build” command takes parameters from the local python script. Not the github repo. Preliminary messing suggests to me that prefect doesn’t really check the equivalence between the local python script (that will be used to generate deployment yaml file) and the github repo script. This needs further checking. But in subsequent runs, changes in the GitHub repo is reflected in the output. Perhaps, the default parameters are defined in yaml file (which is in turn derived from the local python script).
Relevant resources:
YouTube video on prefect with GitHub.