The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Oct 23, 2024 - Python
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Refine high-quality datasets and visual AI models
A Doctor for your data
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Interactively explore unstructured datasets from your dataframe.
Resources for Data Centric AI
A curated, but incomplete, list of data-centric AI resources.
Automatically find issues in image datasets and practice data-centric computer vision.
Curated list of open source tooling for data-centric AI on unstructured data.
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽💻
[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Introduction to Data-Centric AI, MIT IAP 2023 🤖
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
A Data Centric NER annotation tool for your Named Entity Recognition projects
Add a description, image, and links to the data-centric-ai topic page so that developers can more easily learn about it.
To associate your repository with the data-centric-ai topic, visit your repo's landing page and select "manage topics."