User Tools

Site Tools


tutorials

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
tutorials [2022/06/21 14:48] – [OCR & Kuzushiji Reading] prcurtistutorials [2022/12/10 17:10] (current) – [OCR & Kuzushiji Reading] prcurtis
Line 9: Line 9:
 [[https://experiencing.art/|Christopher Morse]] [[https://experiencing.art/|Christopher Morse]]
  
-**Regular Expressions (regex) for Japanese** +===== Dictionaries for Word Segmentation =====
-//Provided by Hoyt Long. Download as a text file by clicking the link in the tab above.// +
-<file text jpn_reg.txt> +
-HTML TAGS: <[^<]+?>+
  
-WORD1 OR WORD2学校|學校+**[[https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html|An Overview of Japanese Tokenizer Dictionaries]]** 
 +[[https://www.dampfkraft.com/|Paul McCann]]
  
-SENTENCE: [^!?。]*[!?。」] 
- 
-QUOTATION: 「[^」]*」 
- 
-ALL HIRAGANA: [ぁ-ゟ ]+   
- 
-ALL KATAKANA: [゠-ヿ]+    
- 
-ALL KANJI: [\u4E00-\u9FEF] 
- 
-METAPHOR (?): .{3}(のように|みたいに).{3} 
-</file> 
 ===== Encoding ===== ===== Encoding =====
  
Line 43: Line 29:
 [[https://www.mstavros.com/home|Matthew Stavros]] [[https://www.mstavros.com/home|Matthew Stavros]]
  
 +**[[https://digitalorientalist.com/2021/11/09/i-just-want-the-data-a-short-guide-to-gsi-japan-for-non-japanese-speaking-users/|“I Just Want the Data!”: A Short Guide to GSI Japan for Non-Japanese-Speaking Users]]**
 +[[https://digitalorientalist.com/author/pulpbindandbond/|Matthew Hayes]]
 ===== OCR & Kuzushiji Reading===== ===== OCR & Kuzushiji Reading=====
  
Line 56: Line 44:
 **[[https://digitalorientalist.com/2021/04/09/google-docs-and-ocr-some-experiments-transcribing-japanese-language-texts/|Google Docs and OCR: Some Experiments Transcribing Japanese Language Texts]]** **[[https://digitalorientalist.com/2021/04/09/google-docs-and-ocr-some-experiments-transcribing-japanese-language-texts/|Google Docs and OCR: Some Experiments Transcribing Japanese Language Texts]]**
 [[https://digitalorientalist.com/author/morrisjh/|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]// [[https://digitalorientalist.com/author/morrisjh/|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]//
 +
 ===== Text Segmentation ===== ===== Text Segmentation =====
  
 **[[https://digitalorientalist.com/2019/03/19/japanese-text-segmentation-with-web-chamame/|Japanese Text Segmentation and Analysis with Web ChaMame]]** **[[https://digitalorientalist.com/2019/03/19/japanese-text-segmentation-with-web-chamame/|Japanese Text Segmentation and Analysis with Web ChaMame]]**
 [[https://tsukuba.academia.edu/JamesMorris|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]// [[https://tsukuba.academia.edu/JamesMorris|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]//
 +
 +**[[https://aclanthology.org/2020.nlposs-1.7/|fugashi, a Tool for Tokenizing Japanese in Python]]**
 +**[[https://slideslive.com/38939744/fugashi-a-tool-for-japanese-tokenization|fugashi: A Tool for Japanese Tokenization]]**
 +[[https://www.dampfkraft.com/|Paul McCann]]
 +
 +**[[https://towardsdatascience.com/how-japanese-tokenizers-work-87ab6b256984|How Japanese Tokenizers Work]]**
 +[[https://medium.com/@wanasit?source=post_page-----87ab6b256984--------------------------------|Wanasit Tanakitrungruang]]
  
 **[[https://digitalorientalist.com/2021/05/11/basic-python-for-japanese-studies-using-fugashi-for-text-segmentation/|Basic Python for Japanese Studies: Using fugashi for Text Segmentation]]** **[[https://digitalorientalist.com/2021/05/11/basic-python-for-japanese-studies-using-fugashi-for-text-segmentation/|Basic Python for Japanese Studies: Using fugashi for Text Segmentation]]**
Line 66: Line 62:
 **[[https://clrd.ninjal.ac.jp/tutorial.html|Tutorials on linguistic corpora (J)]]** **[[https://clrd.ninjal.ac.jp/tutorial.html|Tutorials on linguistic corpora (J)]]**
 [[https://www.ninjal.ac.jp/english/|National Institute for Japanese Language and Linguistics (国立国語研究所)]] [[https://www.ninjal.ac.jp/english/|National Institute for Japanese Language and Linguistics (国立国語研究所)]]
 +
 +
 +**[[https://digitalorientalist.com/2022/12/09/genius-loci-extracting-names-and-places-from-japanese-texts/|Genius loci: extracting names and places from Japanese texts]]**
 +[[https://digitalorientalist.com/about-anna-oskina/|Anna Oskina]], //[[https://digitalorientalist.com|The Digital Orientalist]]//
 +
 ===== Text Mining ===== ===== Text Mining =====
  
Line 74: Line 75:
 **[[https://digitalorientalist.com/2021/06/18/using-voyant-tools-with-historical-japanese-texts/|Using Voyant Tools with Historical Japanese Texts]]** **[[https://digitalorientalist.com/2021/06/18/using-voyant-tools-with-historical-japanese-texts/|Using Voyant Tools with Historical Japanese Texts]]**
 [[https://digitalorientalist.com/author/morrisjh/|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]// [[https://digitalorientalist.com/author/morrisjh/|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]//
 +
 +**[[https://leanpub.com/japanesenlp|Introduction to Japanese Natural Language Processing]]**
 +[[https://twitter.com/mhagiwara|Masato Hagiwara]] and [[https://www.dampfkraft.com/|Paul O'Leary McCann]]
 +
 +
 +**[[https://digitalorientalist.com/2022/12/09/genius-loci-extracting-names-and-places-from-japanese-texts/|Genius loci: extracting names and places from Japanese texts]]**
 +[[https://digitalorientalist.com/about-anna-oskina/|Anna Oskina]], //[[https://digitalorientalist.com|The Digital Orientalist]]//
 +
 +**[[https://steviepoppe.net/blog/2020/04/a-quick-guide-to-data-mining-textual-analysis-of-japanese-twitter/|A Quick Guide to Data-mining & (Textual) Analysis of (Japanese) Twitter Part 1: Twitter Data Collection]]
 +[[https://steviepoppe.net/blog/2020/05/a-quick-guide-to-data-mining-textual-analysis-of-japanese-twitter-part-2/|A Quick Guide to Data-mining & (Textual) Analysis of (Japanese) Twitter Part 2: Basic Metrics & Graphs]]
 +[[https://steviepoppe.net/blog/2020/06/a-quick-guide-to-data-mining-textual-analysis-of-japanese-twitter-part-3/|A Quick Guide to Data-mining & (Textual) Analysis of (Japanese) Twitter Part 3: Natural Language Processing With MeCab, Neologd and KH Coder]]
 +[[https://steviepoppe.net/blog/2020/07/a-quick-guide-to-data-mining-textual-analysis-of-japanese-twitter-part-4/|A Quick Guide to Data-mining & (Textual) Analysis of (Japanese) Twitter Part 4: Natural Language Processing With MeCab, Neologd and NLTK]]**
 +[[https://steviepoppe.net/|Stevie Poppe]]
 ===== Webscraping ===== ===== Webscraping =====
  
 **[[https://www.mollydesjardin.com/blog/crawling-aozora-bunko/|Crawling Aozora Bunko]]** **[[https://www.mollydesjardin.com/blog/crawling-aozora-bunko/|Crawling Aozora Bunko]]**
 [[https://www.mollydesjardin.com/|Molly Des Jardin]] [[https://www.mollydesjardin.com/|Molly Des Jardin]]
- 
  
 **[[https://digitalorientalist.com/2020/01/14/web-scraping-with-python-for-beginners/|Web Scraping with Python for Beginners]]** **[[https://digitalorientalist.com/2020/01/14/web-scraping-with-python-for-beginners/|Web Scraping with Python for Beginners]]**
 [[https://tsukuba.academia.edu/JamesMorris|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]// [[https://tsukuba.academia.edu/JamesMorris|James Harry Morris]], //[[https://digitalorientalist.com|The Digital Orientalist]]//
tutorials.1655822939.txt.gz · Last modified: 2022/06/21 14:48 by prcurtis