Dataset Torrent
Dataset Card for AO3
Dataset containing ArchiveOfOurOwn data archives.
Dataset Summary
This dataset contains approximately 12.6 million publicly available works from AO3. The dataset was created by processing works with IDs from 1 to 63,200,000 that are publicly accessible. Each entry contains the full text of the work along with comprehensive metadata including title, author, fandom, relationships, characters, tags, warnings, and other classification information.
Languages
The dataset is multilingual, with works in many different languages, though English is predominant.
Dataset Structure
Data Files
The dataset is stored in compressed JSONL files (jsonl.zst format), with each archive containing 100,000 sequential IDs. For example, ao3_40500001-40600000.jsonl.zst contains works with IDs in that range.
Data Fields
This dataset includes fields for id, title, metadata, and text. Metadata includes archive warnings, category, characters, fandom, language, rating, relationships, series, author, chapters, completion status, publication date, and word count.
Data Splits
All examples are in a single split.