How to Setup a Local MWAA Development Environment

Tejas Marvadi
CodeX
Published in
6 min readOct 10, 2021

--

Apache Airflow is an open-source tool to programmatically author, schedule, and monitor workflows. It is one of the most robust platforms used by Data Engineers for orchestrating workflows or pipelines. Setting up and managing this tool along with underlying infrastructure requires quite an effort. This is where the Amazon Managed Workflow for Apache Airflow (MWAA) shines. The MWAA makes it easier to set up and operate end-to-end data pipelines in the cloud at scale. With MWAA, a lot of management and underlying infrastructure goes away, but it makes it difficult for Data Engineers to develop on their local machine. This article is going to demonstrate how you can replicate an MWAA environment locally, so the development activities can be easily conducted without frequently needing to push your code and DAGs to your AWS S3 bucket.

Prerequisites:

  • Linux (64-bit)
  • Docker
  • Docker-Compose

Step 1: Structure Setup

In order to mimic the infrastructure, we are going to spin up four different docker containers using docker-compose, as depicted below. We are leveraging the Locastack docker image, which is a cloud emulator that runs in a single container. We are leveraging an Amazon-provided local MWAA docker image, which is similar to an MWAA production image. For the backend data for MWAA metadata, we are using a publicly available Postgres docker image. Finally, a containerized AWS CLI to initialize and set up any services in the Localstack container.

Before we get started, let’s first review our folder structure for this project. The parent folder mwaa contains four folders — .aws, dags, plugins, and landing bucket, and one file — docker-compose.yml. The dags folder contains a sample DAG — sample_dag.py, which we will execute at the end to demonstrate our local MWAA environment. The remaining folders are there for supporting the MWAA configuration setup.

mwaa
|__.aws
|__config
|__credentials
|__dags
|__sample_dag.py
|__plugins
|__plugins.zip
|__landing_bucket
|__customer.csv
|__docker-compose.yml

Step 2: Define Localstack Service

Localstack is a cloud service emulator, and it offers a wide variety of services so you can run your AWS applications entirely on your local machine without connecting to an actual cloud provider. For this demonstration, we are going to use only three services — S3, IAM, and SecretsManager. You can refer to the Localstack official Github for a complete list of supported services. Let’s begin creating a docker-compose.yml file as indicated below. We are exposing port 4566 so we can interact with services like S3 and SecretsManager.

version: "3"services:
localstack:
image: localstack/localstack:latest
container_name: "localstack"
ports:
- "4566:4566"
environment:
- SERVICES=s3,iam,secretsmanager,ec2
- DOCKER_HOST=unix:///var/run/docker.sock
- DEFAULT_REGION=us-east-1
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
- "/tmp/localstack:/tmp/localstack"

Step 3: Define Postgres Service

MWAA uses an Aurora PostgreSQL database as a backend for Airflow metadatabase, where DAG runs and task instances are stored. This Postgres container fulfills the metadatabase need to emulate our MWAA environment locally. Let’s append the following code to the docker-compose.yml file.

postgres:
image: postgres:10-alpine
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
volumes:
- "${PWD}/db-data:/var/lib/postgresql/data"

Step 4: Define Amazon Local MWAA Service

This Amazon Local MWAA docker image allows you to run a local Apache Airflow environment to develop and test DAGs, custom plugins, and dependencies before deploying to MWAA. Add the following code to the docker-compose.yml file.

local-runner:
image: amazon/mwaa-local:2.0.2
user: 1000:1000
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=n
- EXECUTOR=Local
- AIRFLOW_CONN_AWS_DEFAULT=aws://a:a@?host=http://localstack:4566&region_name=us-east-1
- AWS_DEFAULT_REGION=us-east-1
- AWS_PROFILE=default
volumes:
- "${PWD}/dags:/usr/local/airflow/dags"
- "${PWD}/plugins:/usr/local/airflow/plugins"
- "${PWD}/.aws:/usr/local/airflow/.aws"
ports:
- "8080:8080"
command: local-runner
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3

Step 5: Define AWS CLI Service

We are using mesosphere/aws-cli docker image, which contains the aws-cli and allows us to easily create mock resources in the Localstack by issuing CLI commands. We are mapping a local landing folder from our host to this container, so we can push a CSV file to an S3 bucket my-landing-bucket, which we are creating in this container. This is a short-lived container, and it will terminate soon after it’s done creating a bucket and copying a file into the S3 bucket.

aws-cli:
image: mesosphere/aws-cli
container_name: "aws-cli"
volumes:
- ./landing:/tmp/landing:ro
environment:
- AWS_ACCESS_KEY_ID=dummyaccess
- AWS_SECRET_ACCESS_KEY=dummysecret
- AWS_DEFAULT_REGION=us-east-1
entrypoint: /bin/sh -c
command: >
"
echo Waiting for localstack service start...;
while ! nc -z localstack 4566;
do
sleep 1;
done;
echo Connected!;

aws s3api create-bucket --bucket my-landing-bucket --endpoint-url http://localstack:4566
aws s3 cp /tmp/landing s3://my-landing-bucket --recursive --endpoint-url http://localstack:4566"
depends_on:
- localstack

Step 6: Create a DAG & Data File for Testing

This is a sample DAG file to demonstrate our working MWAA environment using the S3 service. We are using a PythonOperator to execute a read_file method, which reads data directly from an S3 bucket. Use the following code to create sample_dag.py and place it under the dags folder.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.S3_hook import S3Hook
from datetime import datetime
default_args = {
"owner": "airflow",
"start_date": datetime(2021, 10, 1),
}
def read_file():
hook = S3Hook()
file_content = hook.read_key(
key="customer.csv", bucket_name="my-landing-bucket"
)
print(file_content)
with DAG("sample_dag", schedule_interval=None, catchup=False, default_args=default_args) as dag:t1 = PythonOperator(
task_id='read_s3_file',
python_callable=read_file
)

Let’s create a new file, customer.csv, which we are using in the DAG to read data.

id,firstname,lastname,email,email2,profession
100,Anestassia,Yate,Anestassia.Yate@opmail.com,Anestassia.Yate@gmail.com,police officer
101,Demetris,Taima,Demetris.Taima@opmail.com,Demetris.Taima@gmail.com,police officer
102,Neila,Karylin,Neila.Karylin@opmail.com,Neila.Karylin@gmail.comworker
....

Step 7: Create a Credentials File for localstack

Create the following credentials file under .aws folder so our local MWAA can use it to interact with Localstack resources.

[default]
aws_access_key_id = dummyaccess
aws_secret_access_key = dummysecret

Step 8: Deploy Local MWAA

If everything is configured correctly, you should be able to execute the following command from the mwaa directory. If you encounter any issue starting the docker containers, then check your configuration files.

docker-compose up -d

Let’s confirm all containers are up and running by executing the following command:

docker ps

If you see all three containers running, then you have successfully deployed your MWAA environment on your local machine. Please note that our fourth aws-cli container may not appear in the list because it terminates soon after it completes executing CLI commands defined in our docker-compose.yml file. Now let’s verify it is working by executing our sample DAG file. Navigate to the Airflow UI at http://localhost:8080 and login with the username adminand password test.

If it runs successfully, you should check the log for the task and you should see the content of the CSV file from the S3 bucket as indicated below.

If you made it this far, then you have successfully deployed a local MWAA environment on your machine. Congratulations!

Conclusion

In this article, I explained how to quickly stand up an MWAA development environment by using docker on your local machine. This setup will help you during your workflow development and testing phase. I hope all scripts from this article will help you as a starting point to quickly stand up an environment for your need. Hope you find these step-by-step instructions useful. Please leave your comment below for this article.

You can read my article How to Scale-out Apache Airflow 2.0 with Redis and Celery if you like to stand up a distributed Airflow environment with Redis and Celery on your local machine.

--

--

Tejas Marvadi
CodeX

Working as a Data Engineer. I prefer Ubuntu over Windows and nano over notepad.