Tôi là Duyệt

Data Engineering Tools written in Rust

Python is a popular language for data engineers, but it is not the most robust or secure. Data engineering is an essential part of modern software development, and Rust is an increasingly popular programming language for this task. Rust is an efficient language, providing fast computation and low-latency responses with a high level of safety and security. Additionally, it offers a unique set of features and capabilities that make it an ideal choice for data engineering ?!.

Rust also now offers more and more of libraries and frameworks for data engineering. These libraries provide a variety of features and tools, such as data analysis, data visualization, and machine learning, which can make data engineering easier and more efficient.

This blog post will provide an overview of the data engineering tools available in Rust, their advantages and benefits, as well as a discussion on why Rust is a great choice for data engineering.

DataFusion

DataFusion
DataFusion

DataFusion, based on Apache Arrow, is an SQL query engine that provides the same functionality as Apache Spark and other similar query engines. It provides an efficient way to process data quickly and accurately, by leveraging the power of Arrow as its backbone. DataFusion offers a range of features that enable developers to build advanced applications that can query millions of records at once, as well as to quickly and easily process complex data. In addition, DataFusion provides support for a wide variety of data sources, allowing developers to access the data they need from anywhere.

Highlight features:

  • Feature-rich SQL support & DataFrame API.
  • Blazingly fast, vectorized, multi-threaded, streaming exec engine.
  • Native support for Parquet, CSV, JSON & Avro.
  • Extension points: user-defined functions, custom plan & exec nodes. Streaming, async. IO from object stores.
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
	// create the dataframe
	let ctx = SessionContext::new();
	let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new()).await?;
	
	let df = df.filter(col("a").lt_eq(col("b")))?
	  .aggregate(vec![col("a")], vec![min(col("b"))])?
	  .limit(0, Some(100))?;
	
	// execute and print results
	df.show().await?;
	Ok(())
}
+---+--------+
| a | MIN(b) |
+---+--------+
| 1 | 2      |
+---+--------+

Polars

Polars
Polars

Polars is a blazingly fast DataFrames library implemented in Rust which takes advantage of the Apache Arrow Columnar Format for memory management. It’s a faster Pandas. You can see at the h2oai’s db-benchmark.

This format allows for high-performance data processing, allowing Polars to process data at an impressive speed. With the combination of Rust’s performance capabilities and the Apache Arrow Columnar Format, Polars is an ideal choice for data scientists looking for an efficient and powerful DataFrames library.

use polars::prelude::*;

let df = LazyCsvReader::new("reddit.csv")
	.has_header(true)
	.with_delimiter(b',')
	.finish()?
	.groupby([col("comment_karma")])
	.agg([
		col("name").n_unique().alias("unique_names"), 
		col("link_karma").max()
	])
	.fetch(100)?;

Highlight features:

  • Lazy | eager execution
  • Multi-threaded
  • SIMD
  • Query optimization
  • Powerful expression API
  • Hybrid Streaming (larger than RAM datasets)
  • Rust | Python | NodeJS | …

Delta Lake Rust

Delta Lake Rust
Delta Lake Rust

Delta Lake provides a native interface in Rust that gives low-level access to Delta tables. This interface can be used with data processing frameworks such as datafusionballistapolarsvega, etc. Additionally, bindings to higher-level languages like Python and Ruby are also available.

Vector.dev

Vector
Vector

A high-performance observability data pipeline for pulling system data (logs, metadata).

Vector is an end-to-end observability data pipeline that puts you in control. It offers high-performance collection, transformation, and routing of your logs, metrics, and traces to any vendor you choose. Vector helps reduce costs, enrich data, and ensure data security. It is up to 10 times faster than other solutions and is open source.

Vector can be deployed as many topologies to collect and forward data: Distributed, Centralized or Stream based.

Vector
Vector

To get started, follow our quickstart guide or install Vector.

Create a configuration file called vector.toml with the following information to help Vector understand how to collect, transform, and sink data.

[sources.generate_syslog]
type = "demo_logs"
format = "syslog"
count = 100

[transforms.remap_syslog]
inputs = [ "generate_syslog"]
type = "remap"
source = '''
  structured = parse_syslog!(.message)
  . = merge(., structured)
'''

[sinks.emit_syslog]
inputs = ["remap_syslog"]
type = "console"
encoding.codec = "json"

ROAPI

ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code. It builds on top of Apache Arrow and Datafusion. The core of its design can be boiled down to the following:

  • Query frontends to translate SQL, GraphQL and REST API queries into Datafusion plans.
  • Datafusion for query plan execution.
  • Data layer to load datasets from a variety of sources and formats with automatic schema inference.
  • Response encoding layer to serialize intermediate Arrow record batch into various formats requested by client.

ROAPI
ROAPI

For example, to spin up APIs for test_data/uk_cities_with_headers.csv and test_data/spacex_launches.json:

roapi \
  --table "uk_cities=test_data/uk_cities_with_headers.csv" \
  --table "test_data/spacex_launches.json"

After that, we can query tables using SQL, GraphQL or REST:

curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
curl -X POST -d "query { uk_cities(limit: 2) {city, lat, lng} }" localhost:8080/api/graphql
curl "localhost:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"

Cube

Cube is a headless business intelligence platform. It enables data engineers and application developers to access data from modern data stores, organize it into consistent definitions, and deliver it to any application.

Cube
Cube

Cube is designed to work with all SQL-enabled data sources, such as cloud data warehouses like Snowflake or Google BigQuery, query engines like Presto or Amazon Athena, and application databases like Postgres. It has a built-in relational caching engine that provides sub-second latency and high concurrency for API requests.

Databend

Databend (https://databend.rs) is an open-source, Elastic and Workload-Aware modern cloud data warehouse that focuses on low cost and low complexity for your massive-scale analytics needs.

Databend uses the latest techniques in vectorized query processing to allow you to do blazing-fast data analytics on object storage: (S3Azure BlobGoogle Cloud StorageAlibaba Cloud OSSTencent Cloud COSHuawei Cloud OBSCloudflare R2Wasabi or MinIO).

Here is the architecture of Databend

Databend
Databend

SurrealDB

SurrealDB is a cloud native database for web, mobile, serverless, Jamstack, backend, and traditional apps. It simplifies your database & API stack, removing server-side components and allowing you to build secure, performant apps faster & cheaper. Features include SQL querying, GraphQL, ACID transactions, WebSocket connections, structured/unstructured data, graph querying, full-text indexing, geospatial querying & row-by-row permissions access. Can run as single server or distributed mode.

Databend
Databend

View the features, the latest releases, the product roadmap, and documentation.

GreptimeDB

GreptimeDB
GreptimeDB

GreptimeDB is an open-source, next-generation hybrid timeseries/analytics processing database in the cloud. It is designed to provide scalability, analytics, and efficiency in modern cloud infrastructure, offering users the advantages of elasticity and cost-effective storage.

Conclusion

There are many more options such as Meilisearch, Tantivy, PRQL, Dozer, Neon, etc. These may be relatively new, but you should give them a try.

At first glance, Rust may seem like an over-engineered choice for data engineering. However, its unique features and capabilities may make it the ideal choice for some data engineering tasks. Rust offers fast computation, low-latency responses, and a high level of safety and security. Additionally, it offers libraries and frameworks for data analysis, data visualization, and machine learning, making data engineering easier and more efficient. With its increasing popularity, Rust is becoming an increasingly attractive option for data engineers.