User Tools

Site Tools


datasets

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
datasets [2022/06/07 16:14] prcurtisdatasets [2022/07/16 02:07] (current) – [Text Data] prcurtis
Line 1: Line 1:
 ======Datasets====== ======Datasets======
  
-Please note that in the interest of space and clarity not every dataset available will be listed in the subcategories below. The Speech Resources Consortium page, for example, provides dozens of corpora, as does the Japan Data Catalog for the Humanities and Social Sciences. Please refer to their pages for more updated information on available datasets.+Please note that in the interest of space and clarity not every dataset available will be listed in the subcategories below. The Speech Resources Consortium page, for example, provides dozens of corpora, as does the Japan Data Catalog for the Humanities and Social Sciences (which had nearly 8,000 open-access datasets as of June 2022). Please refer to their individual pages for more updated information on available datasets.
  
 =====Repositories and Portals===== =====Repositories and Portals=====
Line 17: Line 17:
   * [[http://vsarpj.orinst.ox.ac.uk/corpus/texts.html|Oxford Corpus of Old Japanese]]   * [[http://vsarpj.orinst.ox.ac.uk/corpus/texts.html|Oxford Corpus of Old Japanese]]
   * [[http://jti.lib.virginia.edu/japanese/|Japanese Text Initiative (University of Virginia)]]   * [[http://jti.lib.virginia.edu/japanese/|Japanese Text Initiative (University of Virginia)]]
-  * [[https://clrd.ninjal.ac.jp/bccwj/en/index.html|Balanced Corpus of Contemporary Written Japanese (BCCWJ)]] +  * [[https://clrd.ninjal.ac.jp/bccwj/en/index.html|NINJAL - Balanced Corpus of Contemporary Written Japanese (BCCWJ)]] 
-  * [[https://www2.ninjal.ac.jp/cojads/index.html|Corpus of Japanese Dialects (COJADS)]] +  * [[https://www2.ninjal.ac.jp/cojads/index.html|NINJAL - Corpus of Japanese Dialects (COJADS)]] 
-  * [[https://clrd.ninjal.ac.jp/csj/en/index.html|Corpus of Spontaneous Japanese (CSJ) +  * [[https://clrd.ninjal.ac.jp/csj/en/index.html|NINJAL - Corpus of Spontaneous Japanese (CSJ) 
-  * [[https://www2.ninjal.ac.jp/conversation/cejc.html|Corpus of Everyday Japanese Conversation (CEJC)]] +  * [[https://www2.ninjal.ac.jp/conversation/cejc.html|NINJAL - Corpus of Everyday Japanese Conversation (CEJC)]] 
-  * [[https://www2.ninjal.ac.jp/jll/lsaj/|International Corpus of Japanese as a Second Language (I-JAS)]] +  * [[https://www2.ninjal.ac.jp/jll/lsaj/|NINJAL - International Corpus of Japanese as a Second Language (I-JAS)]] 
-  * [[https://mmsrv.ninjal.ac.jp/nucc/|Nagoya University Conversation Corpus (NUCC)]]+  * [[https://mmsrv.ninjal.ac.jp/nucc/|NINJAL - Nagoya University Conversation Corpus (NUCC)]]
   * [[https://www2.ninjal.ac.jp/conversation/shokuba.html|Gen-Nichi-Ken Corpus of Workplace Conversation (CWPC)]]   * [[https://www2.ninjal.ac.jp/conversation/shokuba.html|Gen-Nichi-Ken Corpus of Workplace Conversation (CWPC)]]
   * [[https://masayu-a.github.io/NWJC/|NINJAL Web Japanese Corpus (NWJC)]]   * [[https://masayu-a.github.io/NWJC/|NINJAL Web Japanese Corpus (NWJC)]]
   * [[https://clrd.ninjal.ac.jp/cmj/index.html|Corpus of Modern Japanese (CMJ)]]   * [[https://clrd.ninjal.ac.jp/cmj/index.html|Corpus of Modern Japanese (CMJ)]]
-  * [[https://masayu-a.github.io/anno/|Annotation Data (Anno)]] +  * [[https://masayu-a.github.io/anno/|NINJAL - Annotation Data (Anno)]] 
-  * [[https://www2.ninjal.ac.jp/conversation/showaCorpus/|Showa Speech Corpus (SSC)]] +  * [[https://www2.ninjal.ac.jp/conversation/showaCorpus/|NINJAL - Showa Speech Corpus (SSC)]] 
-  * [[https://clrd.ninjal.ac.jp/chj/overview-en.html|Corpus of Historical Japanese (CHJ)]]+  * [[https://clrd.ninjal.ac.jp/chj/overview-en.html|NINJAL - Corpus of Historical Japanese (CHJ)]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/en/yahoo/|NII- Yahoo! Datasets ]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/en/rakuten/|NII- Rakuten Datasets ]]  
 +  * [[https://www.nii.ac.jp/dsc/idr/en/rdata/Hazumi/|NII- Osaka University Multimodal Dialogue Corpus (Hazumi)]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/rdata/Ritsumei-ARC/|Ritsumeikan ARC Ukiyo-e Database]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/jast/|JAST Medical Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/athome/|At Home Co. Ltd. Real Estate Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/bengo4/|Bengo4.com Lawyer Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/diet-review/|Diet products Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/oricon/|Oricon customer satisfaction Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/intage/|INTAGE retail Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/fuman/|Insight Tech Co. Ltd. Dissatisfaction Inquiry Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/recruit/|Recruit Co. Ltd. Hot Pepper Beauty Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/nico/|Nico Nico video comment Dataset]] 
 +  * [[https://lapis.nichibun.ac.jp/haikai/menu.html|Nichibun Haikai Database]] (text in HTML) 
 +  * [[https://lapis.nichibun.ac.jp/renga/menu.html|Nichibun Renga Database]] (text in HTML) 
 +  * [[https://lapis.nichibun.ac.jp/waka/menu.html|Nichibun Waka Database]] (text in HTML) 
 +  * [[https://dataverse.harvard.edu/dataverse/amycatalinac|Electoral Datasets]] (Amy Catalinac, Harvard Dataverse Repository)
 =====OCR Training===== =====OCR Training=====
  
Line 39: Line 56:
  
   * [[http://geoshape.ex.nii.ac.jp/|Geoshape Repository (Geoshapeリポジトリ)]]   * [[http://geoshape.ex.nii.ac.jp/|Geoshape Repository (Geoshapeリポジトリ)]]
 +  * [[https://www.gsi.go.jp/ENGLISH/index.html|Geospational Information Authority of Japan]]
   * [[https://sites.fas.harvard.edu/~chgis/data/japan/|Japan Historical Map Data (Harvard)]]   * [[https://sites.fas.harvard.edu/~chgis/data/japan/|Japan Historical Map Data (Harvard)]]
   * [[https://www.jodc.go.jp/aboutJODC_work_data.html|Japan Oceanographic Data Center (JODC)]]   * [[https://www.jodc.go.jp/aboutJODC_work_data.html|Japan Oceanographic Data Center (JODC)]]
Line 49: Line 67:
   * [[http://codh.rois.ac.jp/edo-spots/|Edo Sightseeing Guide]]   * [[http://codh.rois.ac.jp/edo-spots/|Edo Sightseeing Guide]]
  
 +=====Stopwords=====
  
 +**[[https://github.com/stopwords-iso/stopwords-ja/blob/master/stopwords-ja.txt|Common Stopwords for Japanese]]**
 +[[https://github.com/stopwords-iso|Stopwords ISO]]
  
-=====Image Data=====+=====Image & Video Data=====
  
   * [[http://codh.rois.ac.jp/face/|Collection of Facial Expressions]]   * [[http://codh.rois.ac.jp/face/|Collection of Facial Expressions]]
   * [[http://codh.rois.ac.jp/ukiyo-e/|Ukiyo-e Face Image Dataset]]   * [[http://codh.rois.ac.jp/ukiyo-e/|Ukiyo-e Face Image Dataset]]
 +  * [[https://www.nii.ac.jp/dsc/idr/en/yahoo/|NII- LIFULL HOME'S Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/video/video.html|Video Data]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/rdata/NII-GC/|NII - Grand Challenge Conversation Corpus]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/rdata/KoSign/|Kokugakuin University Japanese Sign Language Database (KoSign)]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/rdata/Hazumi/|Osaka University Multimodal Dialogue Corpus (Hazumi)]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/rdata/TDU-NEDO/|Group Communication Corpus (TDU-NEDO)]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/trigger/|Trigger Co. Ltd. Animation Dataset]] 
 +  * [[https://www.nii.ac.jp/dsc/idr/sansan/|Sansan business card Dataset]]
 =====IIIF===== =====IIIF=====
  
   * [[http://bauddha.dhii.jp/SAT/iiifmani/show.php|IIIF Manifests for Buddhist Studies]]   * [[http://bauddha.dhii.jp/SAT/iiifmani/show.php|IIIF Manifests for Buddhist Studies]]
  
datasets.1654618468.txt.gz · Last modified: 2022/06/07 16:14 by prcurtis