Data - AI Algorithm Control Center

👁️ COMPUTER VISION

ImageNet-22K

Large-scale hierarchical image database with 14.2M images across 21,841 classes. Pre-training dataset for vision transformers and CNNs.

IMAGES 14.2M

CLASSES 21,841

SIZE 1.3 TB

RESOLUTION 224x224 avg

STATUS

LOADED

USAGE

87%

COCO-2017

Common Objects in Context dataset for object detection, segmentation, and captioning. 330K images with detailed annotations.

IMAGES 330K

OBJECTS 80 classes

SIZE 25 GB

ANNOTATIONS 2.5M

STATUS

ACTIVE

USAGE

92%

CelebA-HQ

High-quality celebrity faces dataset for face generation, editing, and attribute prediction. 30K high-resolution images.

IMAGES 30K

RESOLUTION 1024x1024

SIZE 2.9 GB

ATTRIBUTES 40 labels

STATUS

IDLE

USAGE

35%

🧠 NATURAL LANGUAGE PROCESSING

The Pile

800GB of diverse text data for language modeling. Includes academic papers, web text, books, and specialized domains.

SIZE 800 GB

TOKENS 400B

DOMAINS 22 sources

LANGUAGES English

STATUS

TRAINING

USAGE

76%

Common Crawl

Web-scale corpus from Common Crawl archives. Filtered and processed for high-quality language model training.

SIZE 570 GB

PAGES 250M

LANGUAGES 100+

QUALITY Filtered

STATUS

ACTIVE

USAGE

89%

GLUE Benchmark

General Language Understanding Evaluation benchmark for natural language understanding tasks.

TASKS 9 NLU

SAMPLES 130K

SIZE 250 MB

METRICS F1, Accuracy

STATUS

LOADED

USAGE

45%

🔗 MULTIMODAL

LAION-5B

Large-scale image-text dataset with 5.85B CLIP-filtered image-text pairs for training multimodal models.

PAIRS 5.85B

SIZE 240 TB

LANGUAGES 100+

FILTER CLIP Score

STATUS

STREAMING

USAGE

67%

Conceptual Captions

Large dataset of images paired with descriptive captions for vision-language understanding.

IMAGES 15M

CAPTIONS 15M

SIZE 350 GB

QUALITY Human-curated

STATUS

ACTIVE

USAGE

82%

🎵 AUDIO & SPEECH

LibriSpeech

Large corpus of read English speech for automatic speech recognition. 1000 hours of 16kHz audio.

HOURS 1000h

SPEAKERS 2,484

SIZE 60 GB

SAMPLING 16 kHz

STATUS

LOADED

USAGE

71%

AudioSet

Large-scale dataset of manually annotated audio events. 2M 10-second audio clips from YouTube videos.

CLIPS 2M

CLASSES 527 audio

DURATION 5,800h

SIZE 230 GB

STATUS

IDLE

USAGE

28%

📈 TIME SERIES & SCIENTIFIC

UCI Time Series Archive

Collection of time series datasets for forecasting and anomaly detection. Multivariate sensor data streams.

DATASETS 128

SAMPLES 50M

SIZE 15 GB

FREQUENCY 1Hz - 1kHz

STATUS

ACTIVE

USAGE

63%

Protein Data Bank

3D molecular structures for protein folding prediction and drug discovery. AlphaFold integration available.

STRUCTURES 200K

PROTEINS 190K

SIZE 750 GB

FORMAT PDB, mmCIF

STATUS

TRAINING

USAGE

91%

AI TRAINING DATASETS

👁️ COMPUTER VISION

ImageNet-22K

COCO-2017

CelebA-HQ

🧠 NATURAL LANGUAGE PROCESSING

The Pile

Common Crawl

GLUE Benchmark

🔗 MULTIMODAL

LAION-5B

Conceptual Captions

🎵 AUDIO & SPEECH

LibriSpeech

AudioSet

📈 TIME SERIES & SCIENTIFIC

UCI Time Series Archive

Protein Data Bank