Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(58)

Side by Side Diff: icu46/source/test/testdata/rbbitst.txt

Issue 6370014: CJK segmentation patch for ICU 4.6... (Closed) Base URL: svn://chrome-svn/chrome/trunk/deps/third_party/
Patch Set: Created 9 years, 10 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch | Annotate | Revision Log
« no previous file with comments | « icu46/source/test/intltest/rbbitst.cpp ('k') | icu46/source/test/testdata/testaliases.txt » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Copyright (c) 2001-2009 International Business Machines 1 # Copyright (c) 2001-2009 International Business Machines
2 # Corporation and others. All Rights Reserved. 2 # Corporation and others. All Rights Reserved.
3 # 3 #
4 # RBBI Test Data 4 # RBBI Test Data
5 # 5 #
6 # File: rbbitst.txt 6 # File: rbbitst.txt
7 # 7 #
8 # The format of this file looks vaguely like some kind of xml-ish markup, 8 # The format of this file looks vaguely like some kind of xml-ish markup,
9 # but it is NOT. The syntax is this.. 9 # but it is NOT. The syntax is this..
10 # 10 #
(...skipping 143 matching lines...) Expand 10 before | Expand all | Expand 10 after
154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111 2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how <200> •are<200> •you<200> •</data> 154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111 2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how <200> •are<200> •you<200> •</data>
155 155
156 156
157 # Words containing non-BMP letters 157 # Words containing non-BMP letters
158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200 > •</data> 158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200 > •</data>
159 159
160 # Unassigned code points 160 # Unassigned code points
161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> 161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>
162 162
163 # Hiragana & Katakana stay together, but separates from each other and Latin. 163 # Hiragana & Katakana stay together, but separates from each other and Latin.
164 <data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBININ G ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}\ N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<300>def<200>#•</data> 164 # *** what to do about theoretical combos of chars? i.e. hiragana + accent
165 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBINI NG ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A} \N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<300>def<200>#•</data>
166
167 # test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth
168 <data>•芽キャベツ<400>芽キャベツ<400></data>
169
170 # more Japanese tests
171 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana
172 # and the Katakana block are not treated correctly. Enable this later.
173 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400> は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>
174 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも <400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>
175
176 # Testing of word boundary for dictionary word containing both kanji and kana
177 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data>
178
179 # Testing of Chinese segmentation (taken from a Chinese news article)
180 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400> 到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的 <400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属 意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da ta>
165 181
166 # Words with interior formatting characters 182 # Words with interior formatting characters
167 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data > 183 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data >
168 184
169 # to test for bug #4097779 185 # to test for bug #4097779
170 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data> 186 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data>
171 187
188 # fullwidth numeric, midletter characters etc should be treated like their halfw idth counterparts
189 <data>•ISN'T<200> •19<100>日<400></data>
172 190
173 # to test for bug #4098467 191 # to test for bug #4098467
174 # What follows is a string of Korean characters (I found it in the Yellow P ages 192 # What follows is a string of Korean characters (I found it in the Yellow P ages
175 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed 193 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed
176 # it correctly), first as precomposed syllables, and then as conjoining jam o. 194 # it correctly), first as precomposed syllables, and then as conjoining jam o.
177 # Both sequences should be semantically identical and break the same way. 195 # Both sequences should be semantically identical and break the same way.
178 # precomposed syllables... 196 # precomposed syllables...
179 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad 50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11 0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11 bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data> 197 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad 50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11 0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11 bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data>
180 198
181 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<200>\u4e03<400>\u4e03<400>abc<200> • </data> 199 # more Korean tests (Jamo not tested here, not counted as dictionary characters)
200 # Disable them now because we don't include a Korean dictionary.
201 #<data>•\ud55c\uad6d<200>\ub300\ud559\uad50<200>\uc790\uc5f0<200>\uacfc\ud559<20 0>\ub300\ud559<200>\ubb3c\ub9ac\ud559\uacfc<200></data>
202 #<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd <200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200> •\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data>
182 203
183 <data>•\u06c9\uc799\ufffa<200></data> 204 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</dat a>
205
206 <data>•\u06c9<200>\uc799<200>\ufffa•</data>
207
184 208
185 # 209 #
186 # Try some words from other scripts. 210 # Try some words from other scripts.
187 # 211 #
188 212
189 # Try some words from other scripts. 213 # Try some words from other scripts.
190 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin 214 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin
191 # 215 #
192 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200 > •ABC<200> •</data> 216 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200 > •ABC<200> •</data>
193 217
(...skipping 290 matching lines...) Expand 10 before | Expand all | Expand 10 after
484 # to test for bug #4098467 508 # to test for bug #4098467
485 # What follows is a string of Korean characters (I found it in the Yellow P ages 509 # What follows is a string of Korean characters (I found it in the Yellow P ages
486 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed 510 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed
487 # it correctly), first as precomposed syllables, and then as conjoining jam o. 511 # it correctly), first as precomposed syllables, and then as conjoining jam o.
488 # Both sequences should be semantically identical and break the same way. 512 # Both sequences should be semantically identical and break the same way.
489 # precomposed syllables... (I == Rich Gillam?) 513 # precomposed syllables... (I == Rich Gillam?)
490 # 514 #
491 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c• </data> 515 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c• </data>
492 516
493 # conjoining jamo... 517 # conjoining jamo...
494 # TODO: rules update needed 518 <data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u1 1ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u11 00\u116d•\u1112\u116c•</data>
495 #<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u 11ab #•\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u 1100\u116d•\u1112\u116c•</data>
496 519
497 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd 520 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd
498 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data> 521 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data>
499 522
500 # Surrogate line break tests. 523 # Surrogate line break tests.
501 # 524 #
502 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> 525 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data>
503 526
504 # Regression for bug 836 527 # Regression for bug 836
505 # Note: Unicode 5.1 changed this behavior 528 # Note: Unicode 5.1 changed this behavior
(...skipping 62 matching lines...) Expand 10 before | Expand all | Expand 10 after
568 591
569 # 592 #
570 # Trac ticket 5595 Test Case 593 # Trac ticket 5595 Test Case
571 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา ง<200>\ 594 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา ง<200>\
572 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200 >ป้า<200>เอ็ม<200>\ 595 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200 >ป้า<200>เอ็ม<200>\
573 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20 0>ไม้<200>\ 596 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20 0>ไม้<200>\
574 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200> ทาง<200>หลาย<200>\ 597 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200> ทาง<200>หลาย<200>\
575 ไมล์<200></data> 598 ไมล์<200></data>
576 599
577 600
OLDNEW
« no previous file with comments | « icu46/source/test/intltest/rbbitst.cpp ('k') | icu46/source/test/testdata/testaliases.txt » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698