Apache Airflow has become a de facto tool for Data Engineering, but don’t overlook other tools out there that can boost your productivity. Hop (Hop Orchestration Platform) project is one of the less known yet powerful tools that is incubating at Apache. In this article, I am going to briefly explain what Apache Hop is and will demonstrate how to create a simple pipeline without writing any code with Hop.
Hop is an open-source data integration tool, which is a fork of Pentaho Data Integration (PDI) or Kettle. It offers a visual development tool that can make developers more productive…
Apache Airflow has become one of the most prevalent tools in the Data Engineering space. It is a platform that offers you to programmatically author, schedule, and monitor workflows. Airflow can be configured to run with different executors such as Sequential, Debug, Local, Dask, Celery, Kubernetes, and CeleryKubernetes. As a data engineer, you would generally configure Apache Airflow on your local machine to run with either Sequential or Local Executor for your development work. However, this setup would preclude you from testing your workflow in a distributed environment. In this article, I will walk you through a step-by-step process to…
Are you looking to build a modern data warehouse or a data lake for your organization? If so, should you build one from the ground up or use off-the-shelf from the market? It’s a tough question to answer. In this article, I am going to share some detail on a couple of open source technologies — Trino and MinIO — that can be leveraged to build this modern data warehouse either on-premises or on a public cloud. I briefly shared a step-by-step process in my previous article (How to Build a Data Lake with MinIO) on how to set it…
In this article, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Trino) to achieve this. This setup can be used for development by data engineers and developers. The method I discuss here can also be extended to deploy in a distributed production environment at your organization.
Let’s download the executable from MinIO using wget and make these files executable using the…
Working as a Data Engineer. I prefer Ubuntu over Windows and nano over notepad.