| OLD | NEW |
| 1 # Copyright (c) 2001-2009 International Business Machines | 1 # Copyright (c) 2001-2009 International Business Machines |
| 2 # Corporation and others. All Rights Reserved. | 2 # Corporation and others. All Rights Reserved. |
| 3 # | 3 # |
| 4 # RBBI Test Data | 4 # RBBI Test Data |
| 5 # | 5 # |
| 6 # File: rbbitst.txt | 6 # File: rbbitst.txt |
| 7 # | 7 # |
| 8 # The format of this file looks vaguely like some kind of xml-ish markup, | 8 # The format of this file looks vaguely like some kind of xml-ish markup, |
| 9 # but it is NOT. The syntax is this.. | 9 # but it is NOT. The syntax is this.. |
| 10 # | 10 # |
| (...skipping 143 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111
2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how
<200> •are<200> •you<200> •</data> | 154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111
2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how
<200> •are<200> •you<200> •</data> |
| 155 | 155 |
| 156 | 156 |
| 157 # Words containing non-BMP letters | 157 # Words containing non-BMP letters |
| 158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI
CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200
> •</data> | 158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI
CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200
> •</data> |
| 159 | 159 |
| 160 # Unassigned code points | 160 # Unassigned code points |
| 161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> | 161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> |
| 162 | 162 |
| 163 # Hiragana & Katakana stay together, but separates from each other and Latin. | 163 # Hiragana & Katakana stay together, but separates from each other and Latin. |
| 164 <data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBININ
G ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}\
N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA
LETTER N}<300>def<200>#•</data> | 164 # *** what to do about theoretical combos of chars? i.e. hiragana + accent |
| 165 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBINI
NG ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}
\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA
LETTER N}<300>def<200>#•</data> |
| 166 |
| 167 # test normalization/dictionary handling of halfwidth katakana: same dictionary
phrase in fullwidth and halfwidth |
| 168 <data>•芽キャベツ<400>芽キャベツ<400></data> |
| 169 |
| 170 # more Japanese tests |
| 171 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana |
| 172 # and the Katakana block are not treated correctly. Enable this later. |
| 173 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>
は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> |
| 174 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも
<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> |
| 175 |
| 176 # Testing of word boundary for dictionary word containing both kanji and kana |
| 177 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data> |
| 178 |
| 179 # Testing of Chinese segmentation (taken from a Chinese news article) |
| 180 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400>
到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的
<400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属
意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da
ta> |
| 165 | 181 |
| 166 # Words with interior formatting characters | 182 # Words with interior formatting characters |
| 167 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data
> | 183 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data
> |
| 168 | 184 |
| 169 # to test for bug #4097779 | 185 # to test for bug #4097779 |
| 170 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data> | 186 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data> |
| 171 | 187 |
| 188 # fullwidth numeric, midletter characters etc should be treated like their halfw
idth counterparts |
| 189 <data>•ISN'T<200> •19<100>日<400></data> |
| 172 | 190 |
| 173 # to test for bug #4098467 | 191 # to test for bug #4098467 |
| 174 # What follows is a string of Korean characters (I found it in the Yellow P
ages | 192 # What follows is a string of Korean characters (I found it in the Yellow P
ages |
| 175 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed | 193 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed |
| 176 # it correctly), first as precomposed syllables, and then as conjoining jam
o. | 194 # it correctly), first as precomposed syllables, and then as conjoining jam
o. |
| 177 # Both sequences should be semantically identical and break the same way. | 195 # Both sequences should be semantically identical and break the same way. |
| 178 # precomposed syllables... | 196 # precomposed syllables... |
| 179 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad
50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11
0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11
bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data> | 197 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad
50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11
0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11
bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data> |
| 180 | 198 |
| 181 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<200>\u4e03<400>\u4e03<400>abc<200> •
</data> | 199 # more Korean tests (Jamo not tested here, not counted as dictionary characters) |
| 200 # Disable them now because we don't include a Korean dictionary. |
| 201 #<data>•\ud55c\uad6d<200>\ub300\ud559\uad50<200>\uc790\uc5f0<200>\uacfc\ud559<20
0>\ub300\ud559<200>\ubb3c\ub9ac\ud559\uacfc<200></data> |
| 202 #<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd
<200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200>
•\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data> |
| 182 | 203 |
| 183 <data>•\u06c9\uc799\ufffa<200></data> | 204 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</dat
a> |
| 205 |
| 206 <data>•\u06c9<200>\uc799<200>\ufffa•</data> |
| 207 |
| 184 | 208 |
| 185 # | 209 # |
| 186 # Try some words from other scripts. | 210 # Try some words from other scripts. |
| 187 # | 211 # |
| 188 | 212 |
| 189 # Try some words from other scripts. | 213 # Try some words from other scripts. |
| 190 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin | 214 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin |
| 191 # | 215 # |
| 192 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200
> •ABC<200> •</data> | 216 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200
> •ABC<200> •</data> |
| 193 | 217 |
| (...skipping 290 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 484 # to test for bug #4098467 | 508 # to test for bug #4098467 |
| 485 # What follows is a string of Korean characters (I found it in the Yellow P
ages | 509 # What follows is a string of Korean characters (I found it in the Yellow P
ages |
| 486 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed | 510 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran
scribed |
| 487 # it correctly), first as precomposed syllables, and then as conjoining jam
o. | 511 # it correctly), first as precomposed syllables, and then as conjoining jam
o. |
| 488 # Both sequences should be semantically identical and break the same way. | 512 # Both sequences should be semantically identical and break the same way. |
| 489 # precomposed syllables... (I == Rich Gillam?) | 513 # precomposed syllables... (I == Rich Gillam?) |
| 490 # | 514 # |
| 491 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c•
</data> | 515 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c•
</data> |
| 492 | 516 |
| 493 # conjoining jamo... | 517 # conjoining jamo... |
| 494 # TODO: rules update needed | 518 <data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u1
1ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u11
00\u116d•\u1112\u116c•</data> |
| 495 #<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u
11ab #•\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u
1100\u116d•\u1112\u116c•</data> | |
| 496 | 519 |
| 497 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd | 520 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd |
| 498 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data> | 521 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data> |
| 499 | 522 |
| 500 # Surrogate line break tests. | 523 # Surrogate line break tests. |
| 501 # | 524 # |
| 502 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> | 525 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> |
| 503 | 526 |
| 504 # Regression for bug 836 | 527 # Regression for bug 836 |
| 505 # Note: Unicode 5.1 changed this behavior | 528 # Note: Unicode 5.1 changed this behavior |
| (...skipping 62 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 568 | 591 |
| 569 # | 592 # |
| 570 # Trac ticket 5595 Test Case | 593 # Trac ticket 5595 Test Case |
| 571 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา
ง<200>\ | 594 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา
ง<200>\ |
| 572 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200
>ป้า<200>เอ็ม<200>\ | 595 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200
>ป้า<200>เอ็ม<200>\ |
| 573 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20
0>ไม้<200>\ | 596 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20
0>ไม้<200>\ |
| 574 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200>
ทาง<200>หลาย<200>\ | 597 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200>
ทาง<200>หลาย<200>\ |
| 575 ไมล์<200></data> | 598 ไมล์<200></data> |
| 576 | 599 |
| 577 | 600 |
| OLD | NEW |