J/RubyのUnicode Regular Expression - Servlet Garden ＠はてな

http://d.hatena.ne.jp/yokolet/20080801#1217681564で、各言語(regular expression engine)によるUnicode Regular Expressionsの対応状況がRegular Expressions Reference Table of Contentsにまとめてある言ったのですが、すばらしい対応状況のPerlと、そこそこのJavaに、Rubyを一部抜粋するとこんなところです。

Unicode Characters, Properties, Scripts and Blocks
                                                    Java     Perl     Ruby	
\X (Unicode grapheme)                               no       YES  	no  	
\u0000 through \uFFFF (Unicode character)           YES      no 	no
\x{0} through \x{FFFF} (Unicode character)          no       YES 	no
\pL through \pC (Unicode properties)                YES      YES 	no
\p{L} through \p{C} (Unicode properties)            YES      YES 	no
\p{Lu} through \p{Cn} (Unicode property)            YES      YES 	no
\p{L&} and \p{Letter&}                              no       YES 	no
(equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode
properties)
\p{IsL} through \p{IsC} (Unicode properties)        YES      YES 	no
\p{IsLu} through \p{IsCn} (Unicode property)        YES      YES 	no
\p{Letter} through \p{Other} (Unicode properties)   no       YES 	no
\p{Lowercase_Letter} through \p{Not_Assigned}       no       YES 	no
(Unicode property)
\p{IsLetter} through \p{IsOther}                    no       YES 	no
(Unicode properties)
\p{IsLowercase_Letter} through \p{IsNot_Assigned}   no       YES 	no
(Unicode property)
\p{Arabic} through \p{Yi} (Unicode script)          no       YES 	no
\p{IsArabic} through \p{IsYi} (Unicode script)      no       YES 	no
\p{BasicLatin} through \p{Specials} (Unicode block) no       YES 	no
\p{InBasicLatin} through \p{InSpecials}
(Unicode block)                                     YES      YES 	no
Spaces, hyphens and underscores allowed in all long Java 5   YES 	no
names listed above (e.g. BasicLatin can be written
as Basic-Latin or Basic_Latin or Basic Latin)
\P (negated variants of all \p as listed above)     YES      YES 	no
\p{^...} (negated variants of all \p{...} as        no       YES 	no
listed above)

Rubyに限らずスクリプト言語系(ECMA, Python)の対応はとても悪いのが現状のようです。おそらくここで取り上げられているのはRuby 1.8のはずですが、Rubyは1.9になって、Unicodeに対応しているRegular Expression Engineのhttp://www.geocities.jp/kosako3/oniguruma/をベースに独自実装をし始めたところのようです。 RubyのJavaによる実装であるJRubyは以前からOnigurimaのJava版であるJoniを取り入れていて、org.jruby.RubyRegexpやorg.jruby.RubyStringクラスではorg.joniパッケージのクラスが使われています。J/RubyがどのくらいUnicode Regular Expressionに対応しているのか、、、と思い、調べてみたのでメモです。

試しに、こんな関数を定義して実行してみました。

require 'strscan'

def strscan_regexp(str, re)
  puts re
  s = StringScanner.new(str)
  until s.eos?
    s.skip(/\s*/)
    case
      when s.scan(/(\w+)|([^\s\w]+)/)
        if s[0] =~ re
          puts "<<#{$&}>>"
        end
    end
  end
end

strscan_regexp("a b c あ い う ア イ ウ", /\p{Alpha}/)
strscan_regexp("a b c あ い う ア イ ウ", /\p{Hiragana}/u)

これをRuby 1.9で実行すると、

$ ruby1.9 regex_test.rb 
(?-mix:\p{Alpha})
<<a>>
<<b>>
<<c>>
(?-mix:\p{Hiragana})
regex_test.rb:10:in `=~': incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (ArgumentError)
        from regex_test.rb:10:in `strscan_regexp'
        from regex_test.rb:18:in `'

という結果でした。Onigurumaのドキュメント(http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt)によると、UTF-8なら \p{Hiragana}で、UnicodeのHiraganaブロックのマッチングができることになっているのですが、Ruby 1.9はまだできない様子でした。試したのがUbuntuのパッケージマネージャでインストールできる1.9.0 (2007-12-25 revision 14709)という古そうなバージョンなので、最新の1.9は違う結果になるかも、、、なのですが。

ではJRubyは、、、

strscan_regexp("a b c あ い う ア イ ウ", /\p{Alpha}/)
strscan_regexp("a b c あ い う ア イ ウ", /\p{Alpha}/u)
strscan_regexp("a b c あ い う ア イ ウ", /\p{Hiragana}/u)

の3つを試してみたところ、

(?-mix:\p{Alpha})
<<a>>
<<b>>
<<c>>
(?-mix:\p{Alpha})
<<a>>
<<b>>
<<c>>
<<あ>>
<<い>>
　:1: invalid character property name {Hiragana}: /\p{Hiragana}/u (RegexpError)
<<う>>
<<ア>>
<<イ>>
<<ウ>>

という結果になりました。。。なんと、/\p{Alpha}/uはアルファベットだけではなくて、ひらがな、カタカナまでマッチしてしまいました。Rubyのほうではこれを実行したらエラーだったのですが。/\p{Hiragana}/uは、このregular expressionそのものがエラーです！？(おそらくバグではないかと)
というように、Ruby的に実行するとまだまだなので、Java的に、こんなふうにして試してみました。

require 'java'
import 'java.util.regex.Pattern'

def java_regexp(a, re)
  puts re
  pattern = Pattern.compile(re)
  matcher = pattern.matcher(a)
  while(matcher.find())
    puts "<<#{matcher.group()}>>"
  end
end

java_regexp("a b c あ い う ア イ ウ", "\\p{InHiragana}")

JRuby 1.1.3で実行すると、

\p{InHiragana}
<<あ>>
<<い>>
<<う>>

というように、当然ですが、UnicodeのHiraganaブロックだけがマッチしました。 JRubyを使っているならunicode regular expressionsはJavaのクラスを使うべきのようです、今のところ。