icu46/source/test/testdata/rbbitst.txt - Issue 6370014: CJK segmentation patch for ICU 4.6...

Side by Side Diff: icu46/source/test/testdata/rbbitst.txt

Issue 6370014: CJK segmentation patch for ICU 4.6... (Closed) Base URL: svn://chrome-svn/chrome/trunk/deps/third_party/

Patch Set: Created 9 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch | Annotate | Revision Log

OLD	NEW
1 # Copyright (c) 2001-2009 International Business Machines	1 # Copyright (c) 2001-2009 International Business Machines

2 # Corporation and others. All Rights Reserved.	2 # Corporation and others. All Rights Reserved.

3 #	3 #

4 # RBBI Test Data	4 # RBBI Test Data

5 #	5 #

6 # File: rbbitst.txt	6 # File: rbbitst.txt

7 #	7 #

8 # The format of this file looks vaguely like some kind of xml-ish markup,	8 # The format of this file looks vaguely like some kind of xml-ish markup,

9 # but it is NOT. The syntax is this..	9 # but it is NOT. The syntax is this..

10 #	10 #

(...skipping 143 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111 2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how <200> •are<200> •you<200> •</data>	154 <data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u111 2\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how <200> •are<200> •you<200> •</data>

155	155

156	156

157 # Words containing non-BMP letters	157 # Words containing non-BMP letters

158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200 > •</data>	158 <data>•abc\U00010300<200> •abc\N{DESERET SMALL LETTER ENG}<200> •abc\N{MATHEMATI CAL BOLD SMALL Z}<200> •abc\N{MATHEMATICAL SANS-SERIF BOLD ITALIC PI SYMBOL}<200 > •</data>

159	159

160 # Unassigned code points	160 # Unassigned code points

161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>	161 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>

162	162

163 # Hiragana & Katakana stay together, but separates from each other and Latin.	163 # Hiragana & Katakana stay together, but separates from each other and Latin.

164 <data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBININ G ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A}\ N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<300>def<200>#•</data>	164 # *** what to do about theoretical combos of chars? i.e. hiragana + accent

	165 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<300>\N{HIRAGANA LETTER VU}\N{COMBINI NG ACUTE ACCENT}<300>\N{HIRAGANA ITERATION MARK}<300>\N{KATAKANA LETTER SMALL A} \N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<300>def<200>#•</data>

	166

	167 # test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth

	168 <data>•芽キャベツ<400>芽キャﾍﾞツ<400></data>

	169

	170 # more Japanese tests

	171 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana

	172 # and the Katakana block are not treated correctly. Enable this later.

	173 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400> は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>

	174 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400>は<400>我<400>でも <400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>

	175

	176 # Testing of word boundary for dictionary word containing both kanji and kana

	177 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data>

	178

	179 # Testing of Chinese segmentation (taken from a Chinese news article)

	180 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400> 到了<400>“•推荐<400>票<400>”•，•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的 <400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>，•选出<400>他们<400>属意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da ta>

165	181

166 # Words with interior formatting characters	182 # Words with interior formatting characters

167 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data >	183 <data>•def\N{COMBINING ACUTE ACCENT}\N{SYRIAC ABBREVIATION MARK}ghi<200> •</data >

168	184

169 # to test for bug #4097779	185 # to test for bug #4097779

170 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data>	186 <data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data>

171	187

	188 # fullwidth numeric, midletter characters etc should be treated like their halfw idth counterparts

	189 <data>•ＩＳＮ'Ｔ<200> •１９<100>日<400></data>

172	190

173 # to test for bug #4098467	191 # to test for bug #4098467

174 # What follows is a string of Korean characters (I found it in the Yellow P ages	192 # What follows is a string of Korean characters (I found it in the Yellow P ages

175 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed	193 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed

176 # it correctly), first as precomposed syllables, and then as conjoining jam o.	194 # it correctly), first as precomposed syllables, and then as conjoining jam o.

177 # Both sequences should be semantically identical and break the same way.	195 # Both sequences should be semantically identical and break the same way.

178 # precomposed syllables...	196 # precomposed syllables...

179 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad 50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11 0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11 bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data>	197 <data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad 50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u11 0b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11 bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data>

180	198

181 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<200>\u4e03<400>\u4e03<400>abc<200> • </data>	199 # more Korean tests (Jamo not tested here, not counted as dictionary characters)

	200 # Disable them now because we don't include a Korean dictionary.

	201 #<data>•\ud55c\uad6d<200>\ub300\ud559\uad50<200>\uc790\uc5f0<200>\uacfc\ud559<20 0>\ub300\ud559<200>\ubb3c\ub9ac\ud559\uacfc<200></data>

	202 #<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd <200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200> •\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data>

182	203

183 <data>•\u06c9\uc799\ufffa<200></data>	204 <data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</dat a>

	205

	206 <data>•\u06c9<200>\uc799<200>\ufffa•</data>

	207

184	208

185 #	209 #

186 # Try some words from other scripts.	210 # Try some words from other scripts.

187 #	211 #

188	212

189 # Try some words from other scripts.	213 # Try some words from other scripts.

190 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin	214 # Greek, Cyrillic, Hebrew, Arabic, Arabic, Georgian, Latin

191 #	215 #

192 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200 > •ABC<200> •</data>	216 <data>•ΑΒΓ<200> •БВГ<200> •אבג֓<200> •ابت<200> •١٢٣<100> •\u10A0\u10A1\u10A2<200 > •ABC<200> •</data>

193	217

(...skipping 290 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
484 # to test for bug #4098467	508 # to test for bug #4098467

485 # What follows is a string of Korean characters (I found it in the Yellow P ages	509 # What follows is a string of Korean characters (I found it in the Yellow P ages

486 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed	510 # ad for the Korean Presbyterian Church of San Francisco, and I hope I tran scribed

487 # it correctly), first as precomposed syllables, and then as conjoining jam o.	511 # it correctly), first as precomposed syllables, and then as conjoining jam o.

488 # Both sequences should be semantically identical and break the same way.	512 # Both sequences should be semantically identical and break the same way.

489 # precomposed syllables... (I == Rich Gillam?)	513 # precomposed syllables... (I == Rich Gillam?)

490 #	514 #

491 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c• </data>	515 <data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c• </data>

492	516

493 # conjoining jamo...	517 # conjoining jamo...

494 # TODO: rules update needed	518 <data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u1 1ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u11 00\u116d•\u1112\u116c•</data>

495 #<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u 11ab #•\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u 1100\u116d•\u1112\u116c•</data>

496	519

497 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd	520 # to test for bug #4117554: Fullwidth .!? should be treated as postJwrd

498 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data>	521 <data>•\u4e01\uff0e•\u4e02\uff01•\u4e03\uff1f•</data>

499	522

500 # Surrogate line break tests.	523 # Surrogate line break tests.

501 #	524 #

502 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data>	525 <data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data>

503	526

504 # Regression for bug 836	527 # Regression for bug 836

505 # Note: Unicode 5.1 changed this behavior	528 # Note: Unicode 5.1 changed this behavior

(...skipping 62 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
568	591

569 #	592 #

570 # Trac ticket 5595 Test Case	593 # Trac ticket 5595 Test Case

571 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา ง<200>\	594 <data>•บท<200>ที่๑พายุ<200>ไซโคลน<200>โด<200>โรธี<200>อาศัย<200>อยู่<200>ท่ามกลา ง<200>\

572 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200 >ป้า<200>เอ็ม<200>\	595 ทุ่งใหญ่<200>ใน<200>แคนซัส<200>กับ<200>ลุง<200>เฮ<200>นรี<200>ชาวไร่<200>และ<200 >ป้า<200>เอ็ม<200>\

573 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20 0>ไม้<200>\	596 ภรรยา<200>ชาวไร่<200>บ้าน<200>ของ<200>พวก<200>เขา<200>หลัง<200>เล็ก<200>เพราะ<20 0>ไม้<200>\

574 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200> ทาง<200>หลาย<200>\	597 สร้าง<200>บ้าน<200>ต้อง<200>ขน<200>มา<200>ด้วย<200>เกวียน<200>เป็น<200>ระยะ<200> ทาง<200>หลาย<200>\

575 ไมล์<200></data>	598 ไมล์<200></data>

576	599

577	600

OLD	NEW

« no previous file with comments | « icu46/source/test/intltest/rbbitst.cpp ('k') | icu46/source/test/testdata/testaliases.txt » ('j') | no next file with comments »