Template for a Dataflow Flex Template in Python

Last update: Apr 28, 2022

Related tags

Overview

Dataflow Flex Template in Python

This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build Dataflow jobs to run in STOIX using Dataflow runner.

The code is based on the same example data as Google Cloud Python Quickstart, "King Lear" which is a tragedy written by William Shakespeare.

The Dataflow job reads the file content, count occurencies of each word and inserts it to a BigQuery table. The schedule date is also added to the table name producing a sharded table for the output.

Source data:

https://storage.cloud.google.com/dataflow-samples/shakespeare/kinglear.txt
gs://dataflow-samples/shakespeare/kinglear.txt

Template maintained by STOIX.

Configuration

The job is configured with the following pipeline options:

stoix_scheduled - Scheduled datetime as RFC3339
input_file - Text to read
output_dataset - BigQuery dataset for output table
output_table_prefix - BigQuery output table name prefix
project - Google Cloud project id

When using Dataflow runner, stoix_scheduled is automatically set and other pipeline options can be added as described in the Dataflow runner README.

Test the code

Tox is used to format, test and lint the code. Make sure to install it with pip install tox and then just run tox within the project folder.

Run pipeline

In order to work with the code locally, you can use Python virtual environments. Make sure to use Python version 3.7.10 as it is the version supported by Google Dataflow.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .

Run on local machine

See quickstart python for further description of arguments.

python -m main \
    --region europe-north1 \
    --runner DirectRunner \
    --stoix_scheduled 2021-01-01T00:00:00Z \
    --input_file gs://dataflow-samples/shakespeare/kinglear.txt \
    --output_table_prefix kinglear \
    --output_dataset 
   
     \
    --project 
    
      \
    --temp_location gs://
     
      /tmp/

Build Docker image for STOIX

In order to run the pipeline the Flex Template needs to be packaged in a Docker image and pushed to a Docker image repository. In this example Docker Hub is used.

Set the tag to the name and version of your pipeline, e.g: stoix/count-words:1.0.0.

$ docker build --tag stoix/count-words:1.0.0 .

Then upload the image to the Docker image repository.

$ docker push stoix/count-words:1.0.0

Run Dataflow on STOIX

Now the Dataflow Flex Template job can be ran using Dataflow runner. Add a new job with the image stoix/dataflow-runner and the following environment variables:

GCP_PROJECT_ID:
GCP_REGION: europe-north1
GCP_SERVICE_ACCOUNT: BASE64 encoded service account JSON
JOB_IMAGE: stoix/count-words:1.0.0
JOB_NAME_PREFIX: count-words
JOB_PARAM_INPUT_FILE: gs://dataflow-samples/shakespeare/kinglear.txt
JOB_PARAM_OUTPUT_DATASET: dataflow
JOB_PARAM_OUTPUT_TABLE_PREFIX: kinglear
JOB_SDK_LANGUAGE: python

Note: When running this in production, set GCP_SERVICE_ACCOUNT as a secret instead of environment variable.

License

MIT

Template for a Dataflow Flex Template in Python

Related tags

Overview

Dataflow Flex Template in Python

Configuration

Test the code

Run pipeline

Build Docker image for STOIX

Run Dataflow on STOIX

License

Owner

STOIX

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Investigating EV charging data

Python ELT Studio, an application for building ELT (and ETL) data flows.

Tools for working with MARC data in Catalogue Bridge.

SparseLasso: Sparse Solutions for the Lasso

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Geospatial data-science analysis on reasons behind delay in Grab ride-share services

A DSL for data-driven computational pipelines

CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

BAyesian Model-Building Interface (Bambi) in Python.

The lastest all in one bombing tool coded in python uses tbomb api

A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

Shot notebooks resuming the main functions of GeoPandas

A simplified prototype for an as-built tracking database with API

This is a python script to navigate and extract the FSD50K dataset

Wafer Fault Detection - Wafer circleci with python