Data Lake brings revolution to the data world, it allows you to store relational data from business applications and operational databases, and non-relational data from mobile apps, IoT devices, and social media at any scale. Having all your data in this centralized repository (Data Lake) opened up a new way to perform analytics — from building a basic dashboard to complex machine learning that helps businesses make better decisions.

The data lake solves many problems that were once challenging to solve, for example, storing data at scale, storing both structured and unstructured data in one place, and running analytics and…


Apache Airflow has become a de facto tool for Data Engineering, but don’t overlook other tools out there that can boost your productivity. Hop (Hop Orchestration Platform) project is one of the less known yet powerful tools that is incubating at Apache. In this article, I am going to briefly explain what Apache Hop is and will demonstrate how to create a simple pipeline without writing any code with Hop.

Official Apache Hop Project Logo

What is Apache Hop?

Hop is an open-source data integration tool, which is a fork of Pentaho Data Integration (PDI) or Kettle. It offers a visual development tool that can make developers more productive…


Apache Airflow has become one of the most prevalent tools in the Data Engineering space. It is a platform that offers you to programmatically author, schedule, and monitor workflows. Airflow can be configured to run with different executors such as Sequential, Debug, Local, Dask, Celery, Kubernetes, and CeleryKubernetes. As a data engineer, you would generally configure Apache Airflow on your local machine to run with either Sequential or Local Executor for your development work. However, this setup would preclude you from testing your workflow in a distributed environment. In this article, I will walk you through a step-by-step process to…


Are you looking to build a modern data warehouse or a data lake for your organization? If so, should you build one from the ground up or use off-the-shelf from the market? It’s a tough question to answer. In this article, I am going to share some detail on a couple of open source technologies — Trino and MinIO — that can be leveraged to build this modern data warehouse either on-premises or on a public cloud. I briefly shared a step-by-step process in my previous article (How to Build a Data Lake with MinIO) on how to set it…


In this article, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Trino) to achieve this. This setup can be used for development by data engineers and developers. The method I discuss here can also be extended to deploy in a distributed production environment at your organization.


  1. Linux OS (64-bit)
  2. Python 2.6.x, 2.7.x, or 3.x
  3. Java JDK

Step 1: Download MinIO and MinIO Client

Let’s download the executable from MinIO using wget and make these files executable using the…

Tejas Marvadi

Working as a Data Engineer. I prefer Ubuntu over Windows and nano over notepad.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store