OLD | NEW |
1 # Copyright (c) 2001-2009 International Business Machines | 1 # Copyright (c) 2001-2009 International Business Machines |
2 # Corporation and others. All Rights Reserved. | 2 # Corporation and others. All Rights Reserved. |
3 # | 3 # |
4 # RBBI Test Data | 4 # RBBI Test Data |
5 # | 5 # |
6 # File: rbbitst.txt | 6 # File: rbbitst.txt |
7 # | 7 # |
8 # The format of this file looks vaguely like some kind of xml-ish markup, | 8 # The format of this file looks vaguely like some kind of xml-ish markup, |
9 # but it is NOT. The syntax is this.. | 9 # but it is NOT. The syntax is this.. |
10 # | 10 # |
(...skipping 143 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111
2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how
<200> •are<200> •you<200> •</data> | 154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111
2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how
<200> •are<200> •you<200> •</data> |
155 | 155 |
156 | 156 |
157 # Words containing non-BMP letters | 157 # Words containing non-BMP letters |
158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI
CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200
> •</data> | 158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI
CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200
> •</data> |
159 | 159 |
160 # Unassigned code points | 160 # Unassigned code points |
161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> | 161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> |
162 | 162 |
163 # Hiragana & Katakana stay together, but separates from each other and Latin. | 163 # Hiragana & Katakana stay together, but separates from each other and Latin. |
164 <data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBININ
G ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}\
N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA
LETTER N}<300>def<200>#•</data> | 164 # *** what to do about theoretical combos of chars? i.e. hiragana + accent |
| 165 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBINI
NG ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}
\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA
LETTER N}<300>def<200>#•</data> |
| 166 |
| 167 # test normalization/dictionary handling of halfwidth katakana: same dictionary
phrase in fullwidth and halfwidth |
| 168 <data>•芽キャベツ<400>芽キャベツ<400></data> |
| 169 |
| 170 # more Japanese tests |
| 171 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana |
| 172 # and the Katakana block are not treated correctly. Enable this later. |
| 173 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>
は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> |
| 174 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも
<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> |
| 175 |
| 176 # Testing of word boundary for dictionary word containing both kanji and kana |
| 177 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data> |
| 178 |
| 179 # Testing of Chinese segmentation (taken from a Chinese news article) |
| 180 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400>
到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的
<400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属
意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da
ta> |
165 | 181 |
166 # Words with interior formatting characters | 182 # Words with interior formatting characters |
167 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data
> | 183 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data
> |
168 | 184 |
169 # to test for bug #4097779 | 185 # to test for bug #4097779 |
170 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data> | 186 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data> |
171 | 187 |
| 188 # fullwidth numeric, midletter characters etc should be treated like their halfw
idth counterparts |
| 189 <data>•ISN'T<200> •19<100>日<400></data> |
172 | 190 |
173 # to test for bug #4098467 | 191 # to test for bug #4098467 |
174 # What follows is a string of Korean characters (I found it in the Yellow P
ages | 192 # What follows is a string of Korean characters (I found it in the Yellow P
ages |
175 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed | 193 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed |
176 # it correctly), first as precomposed syllables, and then as conjoining jam
o. | 194 # it correctly), first as precomposed syllables, and then as conjoining jam
o. |
177 # Both sequences should be semantically identical and break the same way. | 195 # Both sequences should be semantically identical and break the same way. |
178 # precomposed syllables... | 196 # precomposed syllables... |
179 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad
50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11
0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11
bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data> | 197 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad
50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11
0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11
bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data> |
180 | 198 |
181 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<200>\u4e03<400>\u4e03<400>abc<200> •
</data> | 199 # more Korean tests (Jamo not tested here, not counted as dictionary characters) |
| 200 # Disable them now because we don't include a Korean dictionary. |
| 201 #<data>•\ud55c\uad6d<200>\ub300\ud559\uad50<200>\uc790\uc5f0<200>\uacfc\ud559<20
0>\ub300\ud559<200>\ubb3c\ub9ac\ud559\uacfc<200></data> |
| 202 #<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd
<200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200>
•\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data> |
182 | 203 |
183 <data>•\u06c9\uc799\ufffa<200></data> | 204 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</dat
a> |
| 205 |
| 206 <data>•\u06c9<200>\uc799<200>\ufffa•</data> |
| 207 |
184 | 208 |
185 # | 209 # |
186 # Try some words from other scripts. | 210 # Try some words from other scripts. |
187 # | 211 # |
188 | 212 |
189 # Try some words from other scripts. | 213 # Try some words from other scripts. |
190 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin | 214 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin |
191 # | 215 # |
192 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200
> •ABC<200> •</data> | 216 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200
> •ABC<200> •</data> |
193 | 217 |
(...skipping 290 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
484 # to test for bug #4098467 | 508 # to test for bug #4098467 |
485 # What follows is a string of Korean characters (I found it in the Yellow P
ages | 509 # What follows is a string of Korean characters (I found it in the Yellow P
ages |
486 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed | 510 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed |
487 # it correctly), first as precomposed syllables, and then as conjoining jam
o. | 511 # it correctly), first as precomposed syllables, and then as conjoining jam
o. |
488 # Both sequences should be semantically identical and break the same way. | 512 # Both sequences should be semantically identical and break the same way. |
489 # precomposed syllables... (I == Rich Gillam?) | 513 # precomposed syllables... (I == Rich Gillam?) |
490 # | 514 # |
491 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c•
</data> | 515 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c•
</data> |
492 | 516 |
493 # conjoining jamo... | 517 # conjoining jamo... |
494 # TODO: rules update needed | 518 <data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u1
1ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u11
00\u116d•\u1112\u116c•</data> |
495 #<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u
11ab #•\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u
1100\u116d•\u1112\u116c•</data> | |
496 | 519 |
497 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd | 520 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd |
498 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data> | 521 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data> |
499 | 522 |
500 # Surrogate line break tests. | 523 # Surrogate line break tests. |
501 # | 524 # |
502 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> | 525 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> |
503 | 526 |
504 # Regression for bug 836 | 527 # Regression for bug 836 |
505 # Note: Unicode 5.1 changed this behavior | 528 # Note: Unicode 5.1 changed this behavior |
(...skipping 62 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
568 | 591 |
569 # | 592 # |
570 # Trac ticket 5595 Test Case | 593 # Trac ticket 5595 Test Case |
571 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา
ง<200>\ | 594 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา
ง<200>\ |
572 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200
>ป้า<200>เอ็ม<200>\ | 595 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200
>ป้า<200>เอ็ม<200>\ |
573 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20
0>ไม้<200>\ | 596 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20
0>ไม้<200>\ |
574 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200>
ทาง<200>หลาย<200>\ | 597 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200>
ทาง<200>หลาย<200>\ |
575 ไมล์<200></data> | 598 ไมล์<200></data> |
576 | 599 |
577 | 600 |
OLD | NEW |