User Tools

Site Tools


regex

This is an old revision of the document!


Regular Expressions (Regex) for Japanese

The regular expressions provided below are separated by source and included as code blocks for easy copy-pasting. They can also be downloaded as text files by clicking the link in the tab of each code block.

Expressions below provided by Hoyt Long (University of Chicago).

jpn_reg.txt
HTML TAGS: <[^<]+?>
 
WORD1 OR WORD2: 学校|學校
 
SENTENCE: [^!?。]*[!?。」]
 
QUOTATION: 「[^」]*」
 
ALL HIRAGANA: [ぁ-ゟ ]+  
 
ALL KATAKANA: [゠-ヿ]+   
 
ALL KANJI: [\u4E00-\u9FEF]
 
METAPHOR (?): .{3}(のように|みたいに).{3}

Expressions below collected from the defunct Crunchytoast page and Terrance Snyder's Github repository.

jpn_reg_crunchytoast.txt
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf)
([一-龯])
 
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
 
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
 
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])
 
Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン!:/])
 
Regex for matching Hirgana
([ぁ-ん])
 
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
 
Regex for matching half-width Katakana (hankaku 半角)
([ァ-ン゙゚])
 
Regex for matching full-width Numbers (zenkaku 全角)
([0-9])
 
Regex for matching full-width Letters (zenkaku 全角)
([A-z])
 
Regex for matching Hiragana codespace characters 
(includes non phonetic characters)
([ぁ-ゞ])
 
Regex for matching full-width (zenkaku) Katakana codespace characters 
(includes non phonetic characters)
([ァ-ヶ])
 
Regex for matching half-width (hankaku) Katakana codespace characters 
(this is an old character set so the order is inconsistent with the hiragana)
([ヲ-゚])
 
Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/
 
Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/
 
Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/
 
Update from 2014 by user cb372
Hiragana = [ぁ-ゔゞ゛゜ー]  // 0x3041-0x3094, 0x309E, 0x309B, 0x309C, 0x30FC
Katakana = [ァ-・ヽヾ゛゜ー]  // 0x30A1-0x30FB, 0x30FD, &#x30FE, 0x309B, 0x309C, 0x30FC
Hiragana or katakana = [ぁ-ゔゞァ-・ヽヾ゛゜ー]  // 0x3041-0x3094, 0x309E, 0x30A1-0x30FB, 0x30FD, &#x30FE, 0x309B, 0x309C, 0x30FC
 
Update from 2019 by user minhloc2011
Just updated full-width Katakana from「30A1」~「30FE」 (Unicode:30FB).
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
Replace to:
([ァ-ヾ])
regex.1655824322.txt.gz · Last modified: 2022/06/21 15:12 by prcurtis