Version Control your Data Lake with LakeFS

Tejas Marvadi
CodeX
Published in
4 min readApr 1, 2021

--

Data Lake brings revolution to the data world, it allows you to store relational data from business applications and operational databases, and non-relational data from mobile apps, IoT devices, and social media at any scale. Having all your data in this centralized repository (Data Lake) opened up a new way to perform analytics — from building a basic dashboard to complex machine learning that helps businesses make better decisions.

The data lake solves many problems that were once challenging to solve, for example, storing data at scale, storing both structured and unstructured data in one place, and running analytics and machine learning on large data. This is a huge advancement; however, there is still a big gap in a data lake and that is data management. Without the use of proper cataloging mechanisms, the data lake would result in a “data swamp”. I would like to introduce you to an open-source platform — lakeFS that may help you avoid turning your data lake into a “data swamp”. It delivers resilience and manageability to your existing data lake. It allows you to build repeatable, atomic and versioned data lake operations.

What is lakeFS?

lakeFS is an open-source platform that enables you to manage your data lake the way you manage your code. It transforms your object storage into a Git-like repository. As a result, you can run parallel pipelines for experimentation and CI/CD for your data. It works seamlessly with all modern data frameworks such as Spark, Kafka, Presto, Databricks, Airflow, Delta Lake, and many more. You can deploy it on any S3 compatible storage solution either in the cloud or on-prem. lakeFS is distributed as a single binary that comprises several logical services. The complete architecture information can be found on their official website.

lakeFS is format-agnostic so you can perform Git-like operations over any object storage. It uses a relational database (Postgres) to store its metadata. This keeps the metadata separate from your object storage, which in turn saves costs because this avoids duplicating data. In a Data Lake, there is no easy way to isolate data. The only option is to copy the data into a different bucket. This is where the lakeFS shines because it allows you to isolate data without copying it. So you can have a single file in your data lake that can have many pointers to it.

Why you need lakeFS?

lakeFS provides a Git-like branching and committing model that allows changes to happen in isolation. It would be a perfect solution for your development environment for data. You can manipulate production data by creating a branch. It gives you an isolated snapshot that you can experiment with while not risking actual production data. It also provides different versions of the same data over time and allows you to do a time travel between those different versions. You can also read from the data lake at any point in time, compare changes, and safely roll back if necessary.

Data Engineers would find the lakeFS helpful especially when they are working on ingesting new data or performing any sort of validation operations. Instead of ingesting data directly in the main branch, the data can be ingested in an isolated branch first. This allows them to perform validation (data format & schema enforcement) before introducing the new data into the data lake. Ingesting data to an isolated branch can also minimize data corruptions in your production due to either upstream changes or recently deployed code changes.

In the event of low-quality data, lakeFS allows you to revert the data to its previous version instantly with one atomic action. The ability to move between the version of data using lakeFS can be a useful to Data Scientist. They can create multiple versions of their machine learning models and compare their accuracy with ease without duplicating the data set in a data lake.

lakeFS allows multiple developers or teams to work in isolation with the same data without stepping on each other toes. It also minimizes duplicating the data in your data lake, which can significantly reduce the usage cost if you are using cloud storage. Finally, it helps avoid turning your data lake into a “data swamp”.

Conclusion

lakeFS is a very interesting project and one that provides a lot of value and flexibility for teams who are working on the data lake. It enables your data lake with Git-like operations: branching, merging, committing, reverting, and rolling back. Like any other open-source technologies, the lakeFS is evolving and adding new features as more people are joining their community to contribute. I briefly highlighted some use cases of the lakeFS in this article. What other use cases do you foresee benefit from the lakeFS? Please share your thoughts and ideas with me in the comment below.

--

--

Tejas Marvadi
CodeX

Working as a Data Engineer. I prefer Ubuntu over Windows and nano over notepad.