Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
-
Updated
Oct 8, 2024 - Python
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features
An open-source tool for reading OvertureMaps data with multiprocessing and additional Quality-of-Life features
db2ixf is a python package with a CLI that simplifies the parsing and processing of IBM Integration eXchange Format (IXF) files.
Seamlessly switch Pandas DataFrame backend to PyArrow.
highspeed timeseries pandas dataframe database
A loose implementation of the deltalake protocol, written in Python on top of pyarrow, focused on extensibility, customizability, and distributed data.
Python scripts to process, and analyze log files using PySpark.
poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars
A bioinformatics extension of 🤗 Datasets library, built for ML applications on biological and omics data, offering easy integration of metadata and low-code data management tools.
Code examples / snippets for website news post
Add a description, image, and links to the pyarrow topic page so that developers can more easily learn about it.
To associate your repository with the pyarrow topic, visit your repo's landing page and select "manage topics."