AI TRAINING DATASETS

👁️ COMPUTER VISION

ImageNet-22K

Large-scale hierarchical image database with 14.2M images across 21,841 classes. Pre-training dataset for vision transformers and CNNs.

IMAGES 14.2M
CLASSES 21,841
SIZE 1.3 TB
RESOLUTION 224x224 avg
STATUS
LOADED
USAGE
87%

COCO-2017

Common Objects in Context dataset for object detection, segmentation, and captioning. 330K images with detailed annotations.

IMAGES 330K
OBJECTS 80 classes
SIZE 25 GB
ANNOTATIONS 2.5M
STATUS
ACTIVE
USAGE
92%

CelebA-HQ

High-quality celebrity faces dataset for face generation, editing, and attribute prediction. 30K high-resolution images.

IMAGES 30K
RESOLUTION 1024x1024
SIZE 2.9 GB
ATTRIBUTES 40 labels
STATUS
IDLE
USAGE
35%

🧠 NATURAL LANGUAGE PROCESSING

The Pile

800GB of diverse text data for language modeling. Includes academic papers, web text, books, and specialized domains.

SIZE 800 GB
TOKENS 400B
DOMAINS 22 sources
LANGUAGES English
STATUS
TRAINING
USAGE
76%

Common Crawl

Web-scale corpus from Common Crawl archives. Filtered and processed for high-quality language model training.

SIZE 570 GB
PAGES 250M
LANGUAGES 100+
QUALITY Filtered
STATUS
ACTIVE
USAGE
89%

GLUE Benchmark

General Language Understanding Evaluation benchmark for natural language understanding tasks.

TASKS 9 NLU
SAMPLES 130K
SIZE 250 MB
METRICS F1, Accuracy
STATUS
LOADED
USAGE
45%

🔗 MULTIMODAL

LAION-5B

Large-scale image-text dataset with 5.85B CLIP-filtered image-text pairs for training multimodal models.

PAIRS 5.85B
SIZE 240 TB
LANGUAGES 100+
FILTER CLIP Score
STATUS
STREAMING
USAGE
67%

Conceptual Captions

Large dataset of images paired with descriptive captions for vision-language understanding.

IMAGES 15M
CAPTIONS 15M
SIZE 350 GB
QUALITY Human-curated
STATUS
ACTIVE
USAGE
82%

🎵 AUDIO & SPEECH

LibriSpeech

Large corpus of read English speech for automatic speech recognition. 1000 hours of 16kHz audio.

HOURS 1000h
SPEAKERS 2,484
SIZE 60 GB
SAMPLING 16 kHz
STATUS
LOADED
USAGE
71%

AudioSet

Large-scale dataset of manually annotated audio events. 2M 10-second audio clips from YouTube videos.

CLIPS 2M
CLASSES 527 audio
DURATION 5,800h
SIZE 230 GB
STATUS
IDLE
USAGE
28%

📈 TIME SERIES & SCIENTIFIC

UCI Time Series Archive

Collection of time series datasets for forecasting and anomaly detection. Multivariate sensor data streams.

DATASETS 128
SAMPLES 50M
SIZE 15 GB
FREQUENCY 1Hz - 1kHz
STATUS
ACTIVE
USAGE
63%

Protein Data Bank

3D molecular structures for protein folding prediction and drug discovery. AlphaFold integration available.

STRUCTURES 200K
PROTEINS 190K
SIZE 750 GB
FORMAT PDB, mmCIF
STATUS
TRAINING
USAGE
91%