User Tools

Site Tools


regex

Regular Expressions (Regex) for Japanese

The regular expressions provided below are separated by source and included as code blocks for easy copy-pasting. They can also be downloaded as text files by clicking the link in the tab of each code block.

Expressions below provided by Hoyt Long (University of Chicago).

jpn_reg.txt
HTML TAGS: <[^<]+?>
 
WORD1 OR WORD2: 学校|學校
 
SENTENCE: [^!?。]*[!?。」]
 
QUOTATION: 「[^」]*」
 
ALL HIRAGANA: [ぁ-ゟ ]+  
 
ALL KATAKANA: [゠-ヿ]+   
 
ALL KANJI: [\u4E00-\u9FEF]
 
METAPHOR (?): .{3}(のように|みたいに).{3}

Additional formulas from Regular Expressions 101.

Expressions below collected from the defunct Crunchytoast page and Terrance Snyder's Github repository.

jpn_reg_crunchytoast.txt
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf)
([一-龯])
 
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
 
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
 
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])
 
Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン!:/])
 
Regex for matching Hirgana
([ぁ-ん])
 
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
 
Regex for matching half-width Katakana (hankaku 半角)
([ァ-ン゙゚])
 
Regex for matching full-width Numbers (zenkaku 全角)
([0-9])
 
Regex for matching full-width Letters (zenkaku 全角)
([A-z])
 
Regex for matching Hiragana codespace characters 
(includes non phonetic characters)
([ぁ-ゞ])
 
Regex for matching full-width (zenkaku) Katakana codespace characters 
(includes non phonetic characters)
([ァ-ヶ])
 
Regex for matching half-width (hankaku) Katakana codespace characters 
(this is an old character set so the order is inconsistent with the hiragana)
([ヲ-゚])
 
Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/
 
Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/
 
Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/
 
Update from 2014 by user cb372
Hiragana = [ぁ-ゔゞ゛゜ー]  // 0x3041-0x3094, 0x309E, 0x309B, 0x309C, 0x30FC
Katakana = [ァ-・ヽヾ゛゜ー]  // 0x30A1-0x30FB, 0x30FD, &#x30FE, 0x309B, 0x309C, 0x30FC
Hiragana or katakana = [ぁ-ゔゞァ-・ヽヾ゛゜ー]  // 0x3041-0x3094, 0x309E, 0x30A1-0x30FB, 0x30FD, &#x30FE, 0x309B, 0x309C, 0x30FC
 
Update from 2019 by user minhloc2011
Just updated full-width Katakana from「30A1」~「30FE」 (Unicode:30FB).
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
Replace to:
([ァ-ヾ])
regex.txt · Last modified: 2022/06/21 18:06 by prcurtis