Adaltas Summit 2022 Morzine | Adaltas

For its third edition, the whole Adaltas crew is gathering in Morzine for a whole week with 2 days dedicated to technology the 15th and the 16Th of september 2022.

The speakers choose one of the 3 formats available:

  • Presentation: from 20 minutes to 1 hour
  • Demonstration: from 45mn to 2h
  • Training: from 1h to 2h

Program

Once an intervention has been carried out, its supported resources as well as an article covering the intervention will be published on the Adaltas website. Here is the calendar and the list of topics covered during this week.

Thursday, September 15th, 2022

  • 9:30 Kubernetes Networking Lab
  • 10:45 Operating Kafka clusters on Kubernetes with Strimzi
  • 12:00 Expose your containers and virtual machines with a public IP
  • 14:30 DuckDB introduction
  • 15:30 Using LXD with Terraform for Local Development Environments
  • 16:30 Comparison of frameworks for data quality validation
  • 17:15 A brief look at Apache Arrow

Friday, September 16th, 2022

  • 9:30 Introduction to SingleStoreDB, the database for transactional and analytical workloads
  • 10:45 Introduction to Apache Iceberg, the open table format
  • 12:00 Introduction to Apache Kyuubi
  • 14:30 Vector Databases, Milvus Overview
  • 15:30 tdp-server, rest service management for tdp clusters
  • 16:30 Ballista, a Rust based distributed query engine
  • 17:15 Data protection around the world

Abstracts

Kubernetes Networking Lab

  • Speaker: Paul-Adrien CORDONNIER
  • Duration: 1h15
  • Format: talk + demo
  • Schedule: Thursday, September 15th, 2022 at 9:30

The goal of this lab is to provide to anyone an introduction to the world of Kubernetes network communications. We will try to cover most of the concepts at a high level and practice it in a sandbox environment.

At the end of the session we should all be able to know what’s the purpose of each element in the networking stack, how they are used. The lab should also serve as a reminder when confusion will inevitably occur during your Kubernetes journey.

Here are the covered concepts:

  • Low level basic networking (CNI)
  • Kubernetes networking API (Services)
  • DNS
  • Expose Kubernetes application outside (LoadBalancer, Ingress, Gateways)
  • Service Mesh

Operating Kafka clusters on Kubernetes with Strimzi

  • Speaker: Leo SCHOUKROUN
  • Duration: 1h15
  • Format: talk + demo
  • Schedule: Thursday, September 15th, 2022 at 10:45

Kubernetes is not the first platform that comes to mind to run Apache Kafka clusters.

We will go through the basics of Strimzi, a Kafka operator for Kubernetes curated by Red Hat. A special focus will be made on the storage problem which is often a pain point on bare metal Kubernetes clusters.

We will also compare Strimzi with other Kafka operators by providing their pros and cons.

The presentation will end with a demonstration presenting various use cases for Strimzi.

Expose your containers and virtual machines with a public IP

  • Speaker: David WORMS
  • Duration: 1h
  • Format: discussion + demo
  • Schedule: Thursday, September 15th, 2022 at 12:00

Virtual machines and containers are commonly exposed to the web with port forwarding. In such case, the public IP is shared with the host machine. While this work well in many scenario, it is sometimes necessary to associate the guest machine with its distinct public IP, for example to host your own email server, to gain access to an internal network, or to expose Kubernetes services.

The general idea is to route the traffic from a public IP or a CIDR subnet to a guest machine running inside a host machine. Said differently, the connectivity exposes containers and virtual machines with a static public address.

It works seamlessly with any hypervisor including VMware ESXi, Citrix Xen Server, OpenStack, and Proxmox, … The covered procedure is using LXD in cluster mode.

DuckDB introduction

  • Speaker: Stephan BAUM
  • Duration: 1h
  • Format: presentation + demo
  • Schedule: Thursday, September 15th, 2022 at 14:30

DuckDB is an embedded columnar-vectorized OLAP DBMS using SQL queries.

We will present the architecture and specificities of DuckDB DBMS, why it has been created, how it achieves its performance by describing the ART indexing process and we will explain in which cases DuckDB should be used or not.
Finally, a demo will illustrate the basic usage of DuckDB in a Python notebook and how it relates to Pandas and Apache Arrow.

Using LXD with Terraform for Local Development Environments

  • Speaker: Gauthier LEONARD
  • Duration: 1h
  • Format: talk + demo
  • Schedule: Thursday, September 15th, 2021 at 15:30

LXD is a modern, secure and powerful system container and virtual machine manager. LXD presents significant advantages over other standard virtualization tools (namely Vagrant):

  • Unified interface for managing containers, VMs, and networks
  • Super fast provisioning thanks to system containers
  • Live resizing of containers/VMs
  • Working both locally and on multiple hosts clusters (therefore usable both for development and production)

Yet the LXD API, the LXC CLI, and cloud-init are pretty hard to apprehend for new users and do not allow easy versioning of environment configurations.

The LXD Terraform provider is an elegant solution to do infra-as-code on top of LXD. In the demo, we will see how to migrate from Vagrant+VirtualBox to Terraform+LXD for local development environments.

Comparison of frameworks for data quality validation

Data quality is an important issue that a lot of companies haven’t addressed yet efficiently.

Even when the tests that are implemented, they are executed manually on a subset of tables. Lately, I was participating in setting up an automated pipeline. Based on their requirements and the technical stack, I proposed several libraries that could be used for the purpose and a PoC with the selected one.

I would like to share the experience on the subject, describe currently the most popular frameworks for data validation and present their pros and cons.
Namely, those frameworks are:

  • Deequ
  • Great Expectations
  • Delta Live Tables (DLT)
  • Soda

A brief look at Apache Arrow

  • Speaker: Albert Konrad
  • Duration: 45min
  • Format: talk + demo
  • Schedule: Thursday, September 15th, 2022 at 17:15

Is it a software development platform? Is it an in-memory data storage format? Or is it just a file format? No, it is Apache Arrow.

We’ll take a very brief look at what Apache Arrow is, what problem(s) it solves and discuss how it appeal to Data Engineers. In a quick demo we’ll also test if Apache Arrow delivers on its promise.

Introduction to SingleStoreDB, the database for transactional and analytical workloads

  • Speaker: Sergei Kudinov
  • Duration: 1h15
  • Format: presentation
  • Schedule: Friday, September 16th, 2022 at 9:30

SingleStoreDB unifies transactions and analytics in a single engine to drive low-latency access to large datasets. With its patented Universal Storage, SingleStore allows operational and analytical workloads to be processed using a single table type. Built for developers and architects, SingleStoreDB is based on a distributed SQL architecture, delivering 10-100 millisecond performance on complex queries.

The presentation will cover the architecture and optimisation techniques by which SingleStore gains performance.

Introduction to Apache Iceberg, the open table format

  • Speaker: Yanis Bariteau
  • Duration: 1h15
  • Format: presentation + demo
  • Schedule: Friday, September 16th, 2022 at 10:45

Iceberg is presently employed by organizations including Netflix, Apple, Adobe, LinkedIn, Expedia, Stripe, and others as the open standard for large analytic tables in the cloud.

It is a table format for analytical datasets that can interface with a wide range of compute engines.
It has a ton of capabilities that enable data professionals to successfully handle large data, even up to tens of petabytes in size, in addition to high-performance searches on data at rest.

Introduction to Apache Kyuubi

  • Speaker: Guillaume Holdorf
  • Duration: 45min
  • Format: presentation
  • Schedule: Friday, September 16th, 2022 at 12:OO

Apache Kyuubi democratizes the access to your data storage solution by allowing SQL requests from any ODBC/JDBC client. The Kyuubi servers allow you to serve a large amount of requests in a distributed way and assure HA, high performances, and secure access to your data.

In this presentation we will see the different feature of Apache Kyuubi and what they allow to do.

Vector Databases, Milvus Overview

  • Speaker: Tobias Chavarria
  • Duration: 45min
  • Format: presentation + demo
  • Schedule: Friday, September 16th, 2022 at 14:3O

Milvus is an open source vector database, built for scalable
similarity search. It is part of the LF AI & Data Foundation.

Milvus provides capabilities like CRUD operations, metadata filtering, and horizontal scaling and offers:

  • Highly Available
  • Highly Scalable
  • Cloud-native

tdp-server, rest service management for tdp clusters

  • Speaker: Guillaume BOUTRY
  • Duration: 1h
  • Format: talk + demo
  • Schedule: Friday, September 16th, 2021 at 15:15

tdp-server is the web service exposing REST Apis over tdp-lib core functionalities while providing multiple users capabilities, security and more contextual information to deployments.

As a reminder, tdp-lib core functionalities are task scheduling (through a DAG definition) and variable versioning (through git repositories).

With tdp-server, you’ll be able to manage services and components as resources where you can use the different endpoints to modify the configuration (with GET, PUT (replaces), PATCH (modifies current)). You cannot add services/components using POST or delete them using DELETE. Knowing which service/component is available is done through tdp-lib using its discovery functionalities.

Then, the most important feature is deploy, with deploy, you’ll be able to perform actions on the cluster. It’s a simple endpoint, containing three parameters: targets, sources, and filter.

Ballista, a Rust based distributed query engine

  • Speaker: Gonzalo Etse
  • Duration: 45min
  • Format: presentation
  • Schedule: Friday, September 16th, 2022 at 16:15

Ballista is a distributed compute engine built with Rust, and leveraging Apache Arrow, Arrow Flight and DataFusion. Its modern architecture permits other programming languages, like Python, C++, and Java, to work without the issues of serialization.

Apache Arrow allows for in-memory use, while flight will further empower efficient data transfer between processes. Further on, DataFusion alongside technologies such as Google Protocol Buffers will enable fast and efficient use of memory across applications.

Ballista is still under work, and is being implemented on top of DataFusion. While still on early stages, the architecture provides excellent memory efficiency and memory usage can be 5x – 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.

Data protection around the world

  • Speaker: Paul Farault
  • Duration: 45min
  • Format: talk
  • Schedule: Friday, September 16th, 2022 at 17:OO

Data protection is a fundamental subject for companies. Not only for personal data (of customers, users or employees), but also for the data of the company itself.

Both of these are discussed, starting from the Alstom case – confronted with the FCPA and the DOJ in 2014 – to the fundamental rules concerning the protection of personal data imposed by the GDPR.

This presentation marks the first step in a series about data protection. Future episodes will address technical responses to these problems.