CODEX

Modern Data Platform using Open Source Technologies

Published in

CodeX

5 min readJan 11, 2021

Are you looking to build a modern data warehouse or a data lake for your organization? If so, should you build one from the ground up or use off-the-shelf from the market? It’s a tough question to answer. In this article, I am going to share some detail on a couple of open source technologies — Trino and MinIO — that can be leveraged to build this modern data warehouse either on-premises or on a public cloud. I briefly shared a step-by-step process in my previous article (How to Build a Data Lake with MinIO) on how to set it up. These two powerful yet lightweight and scalable tools have so much to offer. It’s worth looking at them before you decide to shell out money to a cloud-based data-warehousing company. This article is going to give you an overview of Trino and MinIO, and it also touches upon some features that they can offer once you implement them together as a data platform.

What is Trino?

Trino is really fast and proven at scale with performance. It is a distributed SQL query engine designed to handle a large volume of data over heterogeneous data sources. It can perform a cross-platform query — meaning you can join data across multiple databases that are on separate servers, and they do not need to be within the same data center. It supports both structured and semi-structured data. This gives advantages to data analysts and data scientists because the data is readily available for them to use.

Trino architecture is very similar architecture as MPP (Massively Parallel Processing) database management system.

It has a coordinator node working in-sync with multiple worker nodes. When users submit their SQL query to the coordinator, which uses a custom query and execution engine to parse, plan, and schedule a distributed query plan across the worker nodes. It is designed to support standard ANSI SQL semantics, including complex queries, aggregations, joins, sub-queries, window functions, distinct counts, and approximate percentiles.

No matter where your data lives, Trino uniformly allows access to data through its SQL abstraction layer. The query result returns are extremely fast because of in-memory processing. The intermediate data is pipelined across nodes in the MPP fashion, thus it leads to highly interactive query performance. Trino allows querying data where it lives, and this works for both NoSQL and SQL databases. A single query can combine data from multiple sources allowing you to analyze the data across your entire organization. Trino offers a wide range of connectors such as MySQL, Oracle, SQL Server, Redshift, Kafka, Elasticseach, and many more (see complete list).

What is MinIO?

MinIO is a high performance distributed object storage server, which is compatible with Amazon S3. It can store unstructured data such as photos, videos, log files, backups, and container images. It can be deployed on different hardware. The power of MinIO unleased when you deploy it as a cluster in multiple nodes. It is designed to be minimal and scalable. It is light enough to be bundled along with other application stacks. It protects against data corruption and hardware failure using erasure code and bit-rot protection. It offers a wide range of authentication integration such as Active Directory and Okta. It provides both high-performance and strict read after write consistency for all I/O operations. It is an open-source distributed object storage solution that runs on commodity servers.

Trino + MinIO = Modern Data Platform

Trino and MinIO together can create a modern data platform or you can call it a modern data warehouse. The implementation of MinIO provides a storage solution, whereas Trino serves as a compute. The separation of both the storage and compute provide a scalable platform without any disruption. With this implementation, you have full flexibility to scale out the MinIO cluster without impacting the Trino cluster and vice versa. You can even deploy Trino and MinIO on a public cloud platform in Kubernetes or in a managed service Kubernetes, and it would provide an ultimate solution for your data platform.

The combination of Trino and MinIO provide essentially similar functionalities as what cloud-based data-warehousing companies have to offer. You do not have to worry about data ingress and egress cost with this setup. It also eliminates the vendor lock-in and allows us much-needed flexibility. If you have the right staff and expertise in your team, it’s worth exploring these technologies. The benefit includes having all your data in your private cloud if you decide to deploy it on-premises.

Here are some arbitrary examples demonstrating some of the operations that can be performed when you build your data platform with both Trino and MinIO.

Create a schema-on-read on top of the data files that are in the MinIO bucket. You can essentially create schema-on-read on a variety of file types including ORC, Parquet, JSON, and many more.

CREATE TABLE datalake.first_schema.customer 
(customer_id VARCHAR, 
 name VARCHAR, 
 full_name VARCHAR
)
WITH (external_location='s3a://raw-data/',format = 'csv', skip_header_line_count=1);

Running a query across disparate source systems without moving any data into the MinIO cluster. With a predicate push-down support, the performance of your query would be lightning fast.

SELECT erp_account.id, erp_account_name, sum(crm_customer.amount) 
FROM 
oracle_erp.erp_schema.account erp_account,   # Oracle
mysql_crm.crm_schema.customer crm_customer,  # MySQL datalake.first_schema.customer csv_customer  # CSV File
WHERE erp_account.cmr_id = crm_customer.id
  AND crm_customer.id = csv_customer.id
GROUP BY erp_account.id, erp_account_name;

The Machine Learning plugin provides ML functionality that can train Support Vector Machine (SVM) based classifiers and regressors for the supervised learning problems using SQL.

SELECT
classify(features(2.9, 3, 5.1, 1.8), model) AS predicted_label
FROM 
(SELECT learn_classifier(species, features(sepal_length, sepal_width, petal_length, petal_width)) AS model 
FROM 
datalake.data_analytics.iris
) t;

Trino provides a client interface for different languages such as Python and GO to connect Trino. Please refer to their Github for more details. Here’s a Python code example of connecting to Trino using Basic Authentication.

import trino
conn = trino.dbapi.connect(
    host='coordinator url',
    port=8443,
    user='the-user',
    catalog='the-catalog',
    schema='the-schema',
    http_scheme='https',
    auth=trino.auth.BasicAuthentication("principal id", "password"),
)
cur = conn.cursor()
cur.execute('SELECT * FROM system.runtime.nodes')
rows = cur.fetchall()

Conclusion

In this article, I share detail on two powerful open-source technologies — Trino and MinIO. Together they allow you to build a modern data platform either on-premises or on a public cloud. If you have the right resources in your team, I recommend considering this setup because it provides flexibility and full control over your data.

If you are interested in building this platform, you can refer to the step-by-step basic implementation guide in my previous article (How to Build a Data Lake with MinIO).

I hope you find this article useful. Please leave your comment below for this article.