AI TRAINING DATASETS
👁️ COMPUTER VISION
ImageNet-22K
Large-scale hierarchical image database with 14.2M images across 21,841 classes. Pre-training dataset for vision transformers and CNNs.
COCO-2017
Common Objects in Context dataset for object detection, segmentation, and captioning. 330K images with detailed annotations.
CelebA-HQ
High-quality celebrity faces dataset for face generation, editing, and attribute prediction. 30K high-resolution images.
🧠 NATURAL LANGUAGE PROCESSING
The Pile
800GB of diverse text data for language modeling. Includes academic papers, web text, books, and specialized domains.
Common Crawl
Web-scale corpus from Common Crawl archives. Filtered and processed for high-quality language model training.
GLUE Benchmark
General Language Understanding Evaluation benchmark for natural language understanding tasks.
🔗 MULTIMODAL
LAION-5B
Large-scale image-text dataset with 5.85B CLIP-filtered image-text pairs for training multimodal models.
Conceptual Captions
Large dataset of images paired with descriptive captions for vision-language understanding.
🎵 AUDIO & SPEECH
LibriSpeech
Large corpus of read English speech for automatic speech recognition. 1000 hours of 16kHz audio.
AudioSet
Large-scale dataset of manually annotated audio events. 2M 10-second audio clips from YouTube videos.
📈 TIME SERIES & SCIENTIFIC
UCI Time Series Archive
Collection of time series datasets for forecasting and anomaly detection. Multivariate sensor data streams.
Protein Data Bank
3D molecular structures for protein folding prediction and drug discovery. AlphaFold integration available.