======Datasets====== Please note that in the interest of space and clarity not every dataset available will be listed in the subcategories below. The Speech Resources Consortium page, for example, provides dozens of corpora, as does the Japan Data Catalog for the Humanities and Social Sciences (which had nearly 8,000 open-access datasets as of June 2022). Please refer to their individual pages for more updated information on available datasets. =====Repositories and Portals===== * [[http://codh.rois.ac.jp/dataset/|Center for Open Data in the Humanities (人文学オープンデータ共同利用センター)]] * [[https://jdcat.jsps.go.jp/|Japan Data Catalog for the Humanities and Social Sciences (人文学・社会科学総合データカタログ)]] * [[https://clrd.ninjal.ac.jp/en/|National Institute for Japanese Language and Linguistics Corpora List (国立国語研究所コーパスリスト]] * [[https://research.nii.ac.jp/src/list.html|Speech Resources Consortium (音声資源コンソーシアム) Corpus List]] =====Text Data===== * [[http://codh.rois.ac.jp/pmjt/|Dataset of Premodern Japanese Text]] (text and page images) * [[http://codh.rois.ac.jp/edo-cooking/|Dataset of Edo Cooking Recipes]] (text and page images) * [[https://www.e-stat.go.jp/stat-search/database?page=1|eStat]] (statistical information from government agencies) * [[http://www.stat.go.jp/data/chouki/mokuji.html|Statistics Japan]] (statistical information from the Statistics Bureau) * [[http://vsarpj.orinst.ox.ac.uk/corpus/texts.html|Oxford Corpus of Old Japanese]] * [[http://jti.lib.virginia.edu/japanese/|Japanese Text Initiative (University of Virginia)]] * [[https://clrd.ninjal.ac.jp/bccwj/en/index.html|NINJAL - Balanced Corpus of Contemporary Written Japanese (BCCWJ)]] * [[https://www2.ninjal.ac.jp/cojads/index.html|NINJAL - Corpus of Japanese Dialects (COJADS)]] * [[https://clrd.ninjal.ac.jp/csj/en/index.html|NINJAL - Corpus of Spontaneous Japanese (CSJ) * [[https://www2.ninjal.ac.jp/conversation/cejc.html|NINJAL - Corpus of Everyday Japanese Conversation (CEJC)]] * [[https://www2.ninjal.ac.jp/jll/lsaj/|NINJAL - International Corpus of Japanese as a Second Language (I-JAS)]] * [[https://mmsrv.ninjal.ac.jp/nucc/|NINJAL - Nagoya University Conversation Corpus (NUCC)]] * [[https://www2.ninjal.ac.jp/conversation/shokuba.html|Gen-Nichi-Ken Corpus of Workplace Conversation (CWPC)]] * [[https://masayu-a.github.io/NWJC/|NINJAL Web Japanese Corpus (NWJC)]] * [[https://clrd.ninjal.ac.jp/cmj/index.html|Corpus of Modern Japanese (CMJ)]] * [[https://masayu-a.github.io/anno/|NINJAL - Annotation Data (Anno)]] * [[https://www2.ninjal.ac.jp/conversation/showaCorpus/|NINJAL - Showa Speech Corpus (SSC)]] * [[https://clrd.ninjal.ac.jp/chj/overview-en.html|NINJAL - Corpus of Historical Japanese (CHJ)]] * [[https://www.nii.ac.jp/dsc/idr/en/yahoo/|NII- Yahoo! Datasets ]] * [[https://www.nii.ac.jp/dsc/idr/en/rakuten/|NII- Rakuten Datasets ]] * [[https://www.nii.ac.jp/dsc/idr/en/rdata/Hazumi/|NII- Osaka University Multimodal Dialogue Corpus (Hazumi)]] * [[https://www.nii.ac.jp/dsc/idr/rdata/Ritsumei-ARC/|Ritsumeikan ARC Ukiyo-e Database]] * [[https://www.nii.ac.jp/dsc/idr/jast/|JAST Medical Dataset]] * [[https://www.nii.ac.jp/dsc/idr/athome/|At Home Co. Ltd. Real Estate Dataset]] * [[https://www.nii.ac.jp/dsc/idr/bengo4/|Bengo4.com Lawyer Dataset]] * [[https://www.nii.ac.jp/dsc/idr/diet-review/|Diet products Dataset]] * [[https://www.nii.ac.jp/dsc/idr/oricon/|Oricon customer satisfaction Dataset]] * [[https://www.nii.ac.jp/dsc/idr/intage/|INTAGE retail Dataset]] * [[https://www.nii.ac.jp/dsc/idr/fuman/|Insight Tech Co. Ltd. Dissatisfaction Inquiry Dataset]] * [[https://www.nii.ac.jp/dsc/idr/recruit/|Recruit Co. Ltd. Hot Pepper Beauty Dataset]] * [[https://www.nii.ac.jp/dsc/idr/nico/|Nico Nico video comment Dataset]] * [[https://lapis.nichibun.ac.jp/haikai/menu.html|Nichibun Haikai Database]] (text in HTML) * [[https://lapis.nichibun.ac.jp/renga/menu.html|Nichibun Renga Database]] (text in HTML) * [[https://lapis.nichibun.ac.jp/waka/menu.html|Nichibun Waka Database]] (text in HTML) * [[https://dataverse.harvard.edu/dataverse/amycatalinac|Electoral Datasets]] (Amy Catalinac, Harvard Dataverse Repository) =====OCR Training===== * [[http://codh.rois.ac.jp/char-shape/|Kuzushiji Dataset]] * [[http://codh.rois.ac.jp/kmnist/|KMNIST Dataset]] (kuzushiji) * [[http://codh.rois.ac.jp/modern-magazine/|Dataset of Modern Magazines]] (includes 東洋学芸雑誌, 国民之友, 明六雑誌) * [[http://codh.rois.ac.jp/tensho/|Seal Script Dataset]] =====Maps/GIS===== * [[http://geoshape.ex.nii.ac.jp/|Geoshape Repository (Geoshapeリポジトリ)]] * [[https://www.gsi.go.jp/ENGLISH/index.html|Geospational Information Authority of Japan]] * [[https://sites.fas.harvard.edu/~chgis/data/japan/|Japan Historical Map Data (Harvard)]] * [[https://www.jodc.go.jp/aboutJODC_work_data.html|Japan Oceanographic Data Center (JODC)]] * [[http://www.gsi.go.jp/kankyochiri/gm_japan_e.html|Global Map Japan]] * [[http://www.gsi.go.jp/kankyochiri/eodas_index_e.html|Environmental Monitoring of Japan]] * [[https://www.e-stat.go.jp/gis|eStat GIS data (jSTAT MAP)]] * [[https://www.nihu.jp/ja/publication/source_map|歴史地名データ]] * [[http://codh.rois.ac.jp/edo-shops/|Edo Shopping Guide]] * [[http://codh.rois.ac.jp/edo-maps/|Edo Maps Beta]] * [[http://codh.rois.ac.jp/edo-spots/|Edo Sightseeing Guide]] =====Stopwords===== **[[https://github.com/stopwords-iso/stopwords-ja/blob/master/stopwords-ja.txt|Common Stopwords for Japanese]]** [[https://github.com/stopwords-iso|Stopwords ISO]] =====Image & Video Data===== * [[http://codh.rois.ac.jp/face/|Collection of Facial Expressions]] * [[http://codh.rois.ac.jp/ukiyo-e/|Ukiyo-e Face Image Dataset]] * [[https://www.nii.ac.jp/dsc/idr/en/yahoo/|NII- LIFULL HOME'S Dataset]] * [[https://www.nii.ac.jp/dsc/idr/video/video.html|Video Data]] * [[https://www.nii.ac.jp/dsc/idr/rdata/NII-GC/|NII - Grand Challenge Conversation Corpus]] * [[https://www.nii.ac.jp/dsc/idr/rdata/KoSign/|Kokugakuin University Japanese Sign Language Database (KoSign)]] * [[https://www.nii.ac.jp/dsc/idr/rdata/Hazumi/|Osaka University Multimodal Dialogue Corpus (Hazumi)]] * [[https://www.nii.ac.jp/dsc/idr/rdata/TDU-NEDO/|Group Communication Corpus (TDU-NEDO)]] * [[https://www.nii.ac.jp/dsc/idr/trigger/|Trigger Co. Ltd. Animation Dataset]] * [[https://www.nii.ac.jp/dsc/idr/sansan/|Sansan business card Dataset]] =====IIIF===== * [[http://bauddha.dhii.jp/SAT/iiifmani/show.php|IIIF Manifests for Buddhist Studies]]