CODEX

Apache Airflow has become a de facto tool for Data Engineering, but don’t overlook other tools out there that can boost your productivity. Hop (Hop Orchestration Platform) project is one of the less known yet powerful tools that is incubating at Apache. In this article, I am going to briefly explain what Apache Hop is and will demonstrate how to create a simple pipeline without writing any code with Hop.

Image for post
Image for post
Official Apache Hop Project Logo

Hop is an open-source data integration tool, which is a fork of Pentaho Data Integration (PDI) or Kettle. It offers a visual development tool that can make developers more productive…


CODEX

Apache Airflow has become one of the most prevalent tools in the Data Engineering space. It is a platform that offers you to programmatically author, schedule, and monitor workflows. Airflow can be configured to run with different executors such as Sequential, Debug, Local, Dask, Celery, Kubernetes, and CeleryKubernetes. As a data engineer, you would generally configure Apache Airflow on your local machine to run with either Sequential or Local Executor for your development work. However, this setup would preclude you from testing your workflow in a distributed environment. In this article, I will walk you through a step-by-step process to…


CODEX

Are you looking to build a modern data warehouse or a data lake for your organization? If so, should you build one from the ground up or use off-the-shelf from the market? It’s a tough question to answer. In this article, I am going to share some detail on a couple of open source technologies — Trino and MinIO — that can be leveraged to build this modern data warehouse either on-premises or on a public cloud. I briefly shared a step-by-step process in my previous article (How to Build a Data Lake with MinIO) on how to set it…


CODEX

In this article, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Trino) to achieve this. This setup can be used for development by data engineers and developers. The method I discuss here can also be extended to deploy in a distributed production environment at your organization.

Image for post
Image for post

Prerequisite

  1. Linux OS (64-bit)
  2. Python 2.6.x, 2.7.x, or 3.x
  3. Java JDK

Step 1: Download MinIO and MinIO Client

Let’s download the executable from MinIO using wget and make these files executable using the…

Tejas Marvadi

Working as a Data Engineer. I prefer Ubuntu over Windows and nano over notepad.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store