icu46/source/test/testdata/rbbitst.txt - Issue 6370014: CJK segmentation patch for ICU 4.6...

Unified Diff: icu46/source/test/testdata/rbbitst.txt

Issue 6370014: CJK segmentation patch for ICU 4.6... (Closed) Base URL: svn://chrome-svn/chrome/trunk/deps/third_party/

Patch Set: Created 9 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: icu46/source/test/testdata/rbbitst.txt

===================================================================

--- icu46/source/test/testdata/rbbitst.txt (revision 68397)

+++ icu46/source/test/testdata/rbbitst.txt (working copy)

@@ -161,14 +161,32 @@

# Hiragana & Katakana stay together, but separates from each other and Latin.

-<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBINING ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<300>def<200>#•</data>

+# *** what to do about theoretical combos of chars? i.e. hiragana + accent

+#<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBINING ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<300>def<200>#•</data>

+# test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth

+<data>•芽キャベツ<400>芽キャﾍﾞツ<400></data>

+# more Japanese tests

+# TODO: Currently, U+30FC and other characters (script=common) in the Hiragana

+# and the Katakana block are not treated correctly. Enable this later.

+#<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400>は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>

+<data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400>は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>

+# Testing of word boundary for dictionary word containing both kanji and kana

+<data>•中だるみ<400>蔵王の森<400>ウ離島<400></data>

+# Testing of Chinese segmentation (taken from a Chinese news article)

+<data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400>到了<400>“•推荐<400>票<400>”•，•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的<400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>，•选出<400>他们<400>属意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</data>

# Words with interior formatting characters

<data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data>

# to test for bug #4097779

<data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data>

+# fullwidth numeric, midletter characters etc should be treated like their halfwidth counterparts

+<data>•ＩＳＮ'Ｔ<200> •１９<100>日<400></data>

# to test for bug #4098467

# What follows is a string of Korean characters (I found it in the Yellow Pages

@@ -178,10 +196,16 @@

# precomposed syllables...

-<data>•abc<200>\u4e01<400>\u4e02<400>\u3005<200>\u4e03<400>\u4e03<400>abc<200> •</data>

+# more Korean tests (Jamo not tested here, not counted as dictionary characters)

+# Disable them now because we don't include a Korean dictionary.

+#<data>•\ud55c\uad6d<200>\ub300\ud559\uad50<200>\uc790\uc5f0<200>\uacfc\ud559<200>\ub300\ud559<200>\ubb3c\ub9ac\ud559\uacfc<200></data>

+#<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd<200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200> •\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data>

-<data>•\u06c9\uc799\ufffa<200></data>

+<data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</data>

+<data>•\u06c9<200>\uc799<200>\ufffa•</data>

# Try some words from other scripts.

@@ -491,8 +515,7 @@

# conjoining jamo...

-# TODO: rules update needed

-#<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u11ab #•\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u1100\u116d•\u1112\u116c•</data>

+<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u11ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u1100\u116d•\u1112\u116c•</data>

# to test for bug #4117554: Fullwidth .!? should be treated as postJwrd

« no previous file with comments | « icu46/source/test/intltest/rbbitst.cpp ('k') | icu46/source/test/testdata/testaliases.txt » ('j') | no next file with comments »