If you’ve ever used Python for analytics you’ll have no doubt heard of Pandas, the incredibly popular and easy-to-use data analysis module. With the rise of Rust (and it’s 8-year long love story with Stack Overflow developers) however, a newer and more modern DataFrame library is in development that utilises the Rust language to enable incredibly performant, multi-core analytics.
Polars is a high-performance DataFrame library implemented in Rust, and can be used with Rust natively or via its Python wrapper. It is designed to handle large datasets with ease, providing an user-friendly interface for data manipulation and analysis. The library offers two modules: polars-core
for the core functionality, and polars-io
for input/output operations, allowing you to read and write data in various formats such as CSV, JSON, Parquet, Delta and more.
Polars is powered by Rust and Apache Arrow. Rust is a systems programming language that prioritizes performance, reliability, and productivity. It offers strict memory safety without garbage collection (addressing a major issue with C and C++), making it ideal for high-performance applications. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-agnostic columnar memory format, which enables efficient data sharing across a variety of programming languages without serialization overhead, which is incredibly costly and wasteful.
Polars has several advantages over other DataFrame libraries like Pandas and Dask:
Performance: Polars is built for speed. Thanks to Rust’s performance and the efficient memory representation of Apache Arrow, Polars can handle large datasets much faster than other libraries.
Memory Efficiency: Polars’ use of Apache Arrow’s columnar format results in high memory efficiency. This article shows an example of Polars using 1/10th of the memory that Pandas uses to handle a ten million row DataFrame!
Concurrency: Rust’s excellent support for concurrency allows Polars to execute operations in parallel, taking full advantage of multi-core processors.
Interoperability: Because Polars uses the Arrow columnar format, it can interoperate with other tools in the Arrow ecosystem.
Lazy Evaluation: Polars can perform eager and lazy execution patterns, skipping redundant processing.
In terms of syntax compatibility, Polars aligns nicely with Pandas (if using the Python wrapper). Here are some examples:
# Creating a DataFrame
df_pandas = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
df_polars = pl.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Reading a CSV file
df_pandas = pd.read_csv('file.csv')
df_polars = pl.read_csv('file.csv')
# Filtering data
df_pandas_filtered = df_pandas[df_pandas['A'] > 1]
df_polars_filtered = df_polars.filter(pl.col('A') > 1)
# Grouping and Aggregation
df_pandas_agg = df.groupby('column_name').agg({'other_column': 'mean'})
df_polars_agg = df.groupby('column_name').agg(pl.col('other_column').mean())
Polars’ performance has been compared to other DataFrame libraries in various studies. In the benchmark tests conducted by H2O.ai, Polars demonstrated superior performance on aggregation tasks by completing the 50GB dataset aggregation in just 143 seconds. In comparison, Pandas was unable to complete the task due to insufficient memory.
Furthermore, Polars has been shown to be up to 10x faster than Pandas in benchmark tests. These tests highlight Polars’ superior processing speed and its ability to handle large datasets efficiently.
In a comparison study titled "Polars vs Dask — Fighting on Parallel Computing"3, Polars and Dask were compared on small, medium, and large-sized datasets to see which library is faster. The results showed that Polars consistently demonstrated faster processing speed.
It’s important to note that while benchmarks can provide useful insights, the performance can vary based on the specific use case and data characteristics. Therefore, it’s always a good idea to conduct your own benchmarks that closely resemble your actual workloads.
Despite Polars’ impressive credentials, it isn’t always the best fit for your analytics workload. It is a single-node DataFrame library, and will certainly struggle to support distributed computation workloads that are typically handled by Spark or similar libraries (if its distributed compute you’re looking for, consider checking out Hydr8, our Databricks-powered solution accelerator!). It also can’t always outperform Pandas when working with data that fits into available memory, and doesn’t offer the same support for common Python visualisation libraries like Matplotlib as Pandas.
As we move towards a future where data is increasingly voluminous and complex, the need for efficient and robust data processing tools becomes ever more critical. Polars, with its high performance, memory efficiency, and interoperability, is well-positioned to play a significant role in this future.
The application of Rust in data analytics is a relatively recent development, but it’s quickly gaining momentum due to its performance and safety features. Polars is leading this trend, but there are other ongoing open source projects to expand Rust’s interaction with the analytics world such as delta-rs, a native Rust library for low-level interaction with Delta tables.
The active development and expanding community surrounding Polars indicate a promising future. Regular additions of new features and improvements, coupled with its rapidly growing popularity (evidenced by 15,000,000 downloads and 21,000 favourites on GitHub!), make Polars an exciting library to keep an eye on.
In conclusion, whether you’re a data scientist seeking a more efficient DataFrame library for large-scale analysis, a Rust enthusiast curious about its use in data analytics, or someone who just wants to give Polars a spin on a Kaggle dataset, it’s definitely worth a look!
Here are some handy links to get started with Polars:
Polars website: Polars
Polars Python documentation: API reference — Polars documentation
Polars Rust documentation: polars - Rust (docs.rs)