Datasets

Datasets

Please note that in the interest of space and clarity not every dataset available will be listed in the subcategories below. The Speech Resources Consortium page, for example, provides dozens of corpora, as does the Japan Data Catalog for the Humanities and Social Sciences (which had nearly 8,000 open-access datasets as of June 2022). Please refer to their individual pages for more updated information on available datasets.

Repositories and Portals

Text Data

Dataset of Premodern Japanese Text (text and page images)
Dataset of Edo Cooking Recipes (text and page images)
eStat (statistical information from government agencies)
Statistics Japan (statistical information from the Statistics Bureau)
Oxford Corpus of Old Japanese
Japanese Text Initiative (University of Virginia)
NINJAL - Balanced Corpus of Contemporary Written Japanese (BCCWJ)
NINJAL - Corpus of Japanese Dialects (COJADS)
NINJAL - Corpus of Spontaneous Japanese (CSJ) * [[https://www2.ninjal.ac.jp/conversation/cejc.html|NINJAL - Corpus of Everyday Japanese Conversation (CEJC)
NINJAL - International Corpus of Japanese as a Second Language (I-JAS)
NINJAL - Nagoya University Conversation Corpus (NUCC)
Gen-Nichi-Ken Corpus of Workplace Conversation (CWPC)
NINJAL Web Japanese Corpus (NWJC)
Corpus of Modern Japanese (CMJ)
NINJAL - Annotation Data (Anno)
NINJAL - Showa Speech Corpus (SSC)
NINJAL - Corpus of Historical Japanese (CHJ)
NII- Yahoo! Datasets
NII- Rakuten Datasets
NII- Osaka University Multimodal Dialogue Corpus (Hazumi)
Ritsumeikan ARC Ukiyo-e Database
JAST Medical Dataset
At Home Co. Ltd. Real Estate Dataset
Bengo4.com Lawyer Dataset
Diet products Dataset
Oricon customer satisfaction Dataset
INTAGE retail Dataset
Insight Tech Co. Ltd. Dissatisfaction Inquiry Dataset
Recruit Co. Ltd. Hot Pepper Beauty Dataset
Nico Nico video comment Dataset
Nichibun Haikai Database (text in HTML)
Nichibun Renga Database (text in HTML)
Nichibun Waka Database (text in HTML)
Electoral Datasets (Amy Catalinac, Harvard Dataverse Repository)

OCR Training

Kuzushiji Dataset
KMNIST Dataset (kuzushiji)
Dataset of Modern Magazines (includes 東洋学芸雑誌, 国民之友, 明六雑誌)
Seal Script Dataset

Table of Contents

Datasets

Repositories and Portals

Text Data

OCR Training

Maps/GIS

Stopwords

Image & Video Data

IIIF