source/test/testdata/rbbitst.txt - Issue 845603002: Update ICU to 54.1 step 1

Side by Side Diff: source/test/testdata/rbbitst.txt

Issue 845603002: Update ICU to 54.1 step 1 (Closed) Base URL: https://chromium.googlesource.com/chromium/deps/icu.git@master

Patch Set: remove unusued directories Created 5 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # Copyright (c) 2001-2013 International Business Machines	1 # Copyright (c) 2001-2014 International Business Machines

2 # Corporation and others. All Rights Reserved.	2 # Corporation and others. All Rights Reserved.

3 #	3 #

4 # RBBI Test Data	4 # RBBI Test Data

5 #	5 #

6 # File: rbbitst.txt	6 # File: rbbitst.txt

7 #	7 #

8 # The format of this file looks vaguely like some kind of xml-ish markup,	8 # The format of this file looks vaguely like some kind of xml-ish markup,

9 # but it is NOT. The syntax is this..	9 # but it is NOT. The syntax is this..

10 #	10 #

11 # <word> any following data is for word break testing	11 # <word> any following data is for word break testing

(...skipping 14 matching lines...) Expand all Loading...
26 # There are two copies of this file in the source repository,	26 # There are two copies of this file in the source repository,

27 # [ICU4C] source/test/testdata/rbbitst.txt	27 # [ICU4C] source/test/testdata/rbbitst.txt

28 # [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt	28 # [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt

29 #	29 #

30 # ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sur e they	30 # ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sur e they

31 # are merged back into ICU4C's copy of the file, lest they get overwritten late r.	31 # are merged back into ICU4C's copy of the file, lest they get overwritten late r.

32 # TODO: figure out how to have a single copy of the file for use by both C and Java.	32 # TODO: figure out how to have a single copy of the file for use by both C and Java.

33	33

34	34

35 # Temp debugging tests	35 # Temp debugging tests

36 <word>	36 <sent>

37 <data>•Isn't<200></data>	37 <data>•\u00c0.•</data>

38 <char>

39 <data>•\U00010020•\U00010000\N{COMBINING MACRON}•</data>

40	38

	39 #<data>•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165:"JAVA\u821c\u8165\u7fc8 \u51ce\u306d,\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46".\u202 9•</data>

41 ################################################################################ ########	40 ################################################################################ ########

42 #	41 #

43 #	42 #

44 # G r a p h e m e C l u s t e r T e s t s	43 # G r a p h e m e C l u s t e r T e s t s

45 #	44 #

46 #	45 #

47 ################################################################################ ##########	46 ################################################################################ ##########

48 <char>	47 <char>

49	48

50 <data>•a•b•c• •,•\u0666•</data> # Quick Test	49 <data>•a•b•c• •,•\u0666•</data> # Quick Test

(...skipping 121 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
172 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>	171 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>

173	172

174 # Hiragana & Katakana stay together, but separates from each other and Latin.	173 # Hiragana & Katakana stay together, but separates from each other and Latin.

175 # *** what to do about theoretical combos of chars? i.e. hiragana + accent	174 # *** what to do about theoretical combos of chars? i.e. hiragana + accent

176 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINI NG ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A} \N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<400>def<200>#•</data>	175 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINI NG ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A} \N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<400>def<200>#•</data>

177	176

178 # test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth	177 # test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth

179 <data>•芽キャベツ<400>芽キャﾍﾞツ<400></data>	178 <data>•芽キャベツ<400>芽キャﾍﾞツ<400></data>

180	179

181 # more Japanese tests	180 # more Japanese tests

182 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana	181 # TODO: some script=common characters in the Hiragana and the Katakana block may not be treated correctly

183 # and the Katakana block are not treated correctly. Enable this later.	182 # (was formerly true for U+30FC); need to check and fix if so.

184 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400> は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>	183 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400> は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>

185 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400>は<400>我<400>でも <400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>	184 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400>　•て<400>こと<400>は<400>我<400>でも <400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>

186	185

187 # Testing of word boundary for dictionary word containing both kanji and kana	186 # Testing of word boundary for dictionary word containing both kanji and kana

188 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data>	187 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data>

189	188

190 # Testing of Chinese segmentation (taken from a Chinese news article)	189 # Testing of Chinese segmentation (taken from a Chinese news article)

191 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400> 到了<400>“•推荐<400>票<400>”•，•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的 <400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>，•选出<400>他们<400>属意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da ta>	190 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400> 到了<400>“•推荐<400>票<400>”•，•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的 <400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>，•选出<400>他们<400>属意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da ta>

192	191

193 # Words with interior formatting characters	192 # Words with interior formatting characters

(...skipping 392 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
586 <title>	585 <title>

587 <data>•Here •is •a •short •sample •sentence. •And •another.•</data>	586 <data>•Here •is •a •short •sample •sentence. •And •another.•</data>

588 <data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data>	587 <data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data>

589 <data>• •Start •and •end •with •spaces •</data>	588 <data>• •Start •and •end •with •spaces •</data>

590 <data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</ data>	589 <data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</ data>

591	590

592 <data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data>	591 <data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data>

593 <data>•123 •Start •with •a •number.•</data>	592 <data>•123 •Start •with •a •number.•</data>

594	593

595 <data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data>	594 <data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data>

596	595 <data>•' '' •start •with •case-•ignorable & •case-•insensitive •cha'r'a'cter•</ data>

	596 <data>• ''•aaa' •bbb '•ccc' '•ddd''' '''•eee '''•fff''' •ggg ''•</data>

	597 # Note: apostrophe is case-ignorable. space is not cased.

597	598

598 ################################################################################ ##########	599 ################################################################################ ##########

599 #	600 #

600 # Thai Tests	601 # Thai Tests

601 #	602 #

602 ################################################################################ ##########	603 ################################################################################ ##########

603 <locale th>	604 <locale th>

604 <word>	605 <word>

605 #	606 #

606 # Test data originally from the test code source file	607 # Test data originally from the test code source file

(...skipping 77 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
684 \u0e41\u0e25\u0e30•\	685 \u0e41\u0e25\u0e30•\

685 \u0e40\u0e03\u0e35•\	686 \u0e40\u0e03\u0e35•\

686 \u0e22\u0e07•\	687 \u0e22\u0e07•\

687 \u0e43\u0e2b\u0e21\u0e48•</data>	688 \u0e43\u0e2b\u0e21\u0e48•</data>

688	689

689 # Test for #10296	690 # Test for #10296

690 <line>	691 <line>

691 <data>•ใช•มั้ย•</data>	692 <data>•ใช•มั้ย•</data>

692 <data>•มั๊ยล่ะ•ที่รัก•</data>	693 <data>•มั๊ยล่ะ•ที่รัก•</data>

693	694

	695 # Test for #10593

	696 <line>

	697 <data>•เล่น•ผ่าน•ทาง•บลูทูธ•บน•อุปกรณ์•</data>

	698

	699 # Test for city names #10691

	700 <line>

	701 <data>•ไป•ที่•ซานฟรานซิสโก•</data>

	702

	703 # Test for #10630, #10631

	704 <line>

	705 <data>•แท็ก•แอปพลิเคชัน•เป็น•พิเศษ•</data>

	706

694 ################################################################################ ##########	707 ################################################################################ ##########

695 #	708 #

696 # Lao Tests	709 # Lao Tests

697 #	710 #

698 ################################################################################ ##########	711 ################################################################################ ##########

699 <locale en>	712 <locale en>

700 # Basic check for #7647	713 # Basic check for #7647

701 <line>	714 <line>

702 <data>•ສະບາຍດີ•</data>	715 <data>•ສະບາຍດີ•</data>

703 <data>•ດີ•ຂອບໃຈ•</data>	716 <data>•ດີ•ຂອບໃຈ•</data>

704 <data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data>	717 <data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data>

705 <data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data>	718 <data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data>

706	719

707 ################################################################################ ##########	720 ################################################################################ ##########

708 #	721 #

	722 # Burmese/Myanmar Tests

	723 #

	724 ################################################################################ ##########

	725 <locale en>

	726 # Basic sanity check for #10326 (some text from http://www.unicode.org/udhr/d/ud hr_mya.txt)

	727 <line>

	728 <data>•လူ•တိုင်း•သည် •တူညီ •လွတ်လပ်•သော •ဂုဏ်•သိ•က္•ခါ•ဖြ•င့် •လည်းကောင်း၊ •</da ta>

	729 <data>•တူညီ•လွတ်လပ်•သော •အ•ခွ•င့်•အရေး•များ•ဖြ•င့် •လည်းကောင်း၊ •မွေး•ဖွား•လာ•သူ များ •ဖြစ်သည်။•</data>

	730 <data>•ထို•သူ•တို့၌ •ပိုင်းခြား •ဝေဖန်•တတ်•သော •ဉာဏ်•နှ•င့် •ကျ•င့်•ဝတ် •သိတတ်•သ ော •စိတ်•တို့•ရှိ•ကြ၍ •</data>

	731 <data>•ထို•သူ•တို့သည် •အချင်းချင်း •မေတ္တာ•ထား၍ •ဆက်ဆံ•ကျ•င့်•သုံး•</data>

	732

	733 ################################################################################ ##########

	734 #

709 # Khmer Tests	735 # Khmer Tests

710 #	736 #

711 ################################################################################ ##########	737 ################################################################################ ##########

712	738

713 # Test data originally from http://bugs.icu-project.org/trac/search?q=r30327	739 # Test data originally from http://bugs.icu-project.org/trac/search?q=r30327

714 # from the file testdata/wordsegments.txt	740 # from the file testdata/wordsegments.txt

715 <locale en>	741 <locale en>

716 <word>	742 <word>

717	743

718 <data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data>	744 <data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data>

(...skipping 70 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
789 <data>•abc/\u05D9 •def•</data>	815 <data>•abc/\u05D9 •def•</data>

790 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>	816 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>

791 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data>	817 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data>

792	818

793	819

794 <locale root>	820 <locale root>

795 <word>	821 <word>

796 <data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々 <400>は<400>ワード<400>で<400>ある<400>。•</data>	822 <data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々 <400>は<400>ワード<400>で<400>ある<400>。•</data>

797 # The following test is for #10300	823 # The following test is for #10300

798 <data>•例えば<400>オーストラリア<400>。•</data>	824 <data>•例えば<400>オーストラリア<400>。•</data>

	825 # The following test is for #10571

	826 <data>•一部<400>の<400>地域<400>では<400>、<0>ブラジル<400>、<0>インドネシア<400>、<0>オーストリア<400>、<0 >ニュージーランド<400>で<400>ある<400>。•</data>

799	827

800 # UBreakIteratorType UBRK_SENTENCE, Locale "el"	828 # UBreakIteratorType UBRK_SENTENCE, Locale "el"

801 # Add break after Greek question mark (cldrbug #2069).	829 # Add break after Greek question mark (cldrbug #2069).

802 # "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. "	830 # "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. "

803 # "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3"	831 # "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3"

804 # which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ"	832 # which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ"

805	833

806 <locale root>	834 <locale root>

807 <sent>	835 <sent>

808 <data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>	836 <data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>

809	837

810 <locale el>	838 <locale el>

811 <sent>	839 <sent>

812 <data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>	840 <data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>

813	841

814 # UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX"	842 # UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX"

815 # Words don't include colon or period (cldrbug #1969).	843 # Words don't include colon or period (cldrbug #1969).

816	844

817 <locale en_US>	845 <locale en_US>

818 <word>	846 <word>

819 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct. field<200> \	847 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct. field<200> \

820 •for<200> •CS<200>-•types<200>.•</data>	848 •for<200> •CS<200>-•types<200>.•</data>

821 <data>•\uFF92\uFF76\uFF9E<400> •</data>	849 <data>•\uFF92\uFF76\uFF9E<400> •</data>

822	850

823 <locale en_US_POSIX>	851 <locale en_US_POSIX>

824 <word>	852 <word>

825 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct< 200>.•field<200> \	853 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx<200>:•yy<200> •or<200> •s truct<200>.•field<200> \

826 •for<200> •CS<200>-•types<200>.•</data>	854 •for<200> •CS<200>-•types<200>.•</data>

827 <data>•\u06c9<200>\uc799\ufffa•</data>	855 <data>•\u06c9<200>\uc799\ufffa•</data>

828 <data>•\uFF92\uFF76\uFF9E<400> •</data>	856 <data>•\uFF92\uFF76\uFF9E<400> •</data>

829	857

830	858

831 # UBreakIteratorType UBRK_CHARACTER, Locale "th"	859 # UBreakIteratorType UBRK_CHARACTER, Locale "th"

832 # Clusters should not include spacing Thai/Lao vowels (prefix or postfix), excep t for [SARA] AM (cldrbug #2161).	860 # Clusters should not include spacing Thai/Lao vowels (prefix or postfix), excep t for [SARA] AM (cldrbug #2161).

833 # Update: As of Unicode 6.1 root has same behavior as th for this.	861 # Update: As of Unicode 6.1 root has same behavior as th for this.

834 #	862 #

835 # "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 "	863 # "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 "

(...skipping 27 matching lines...) Expand all Loading...
863	891

864 <data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen	892 <data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen

865 <data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010 hyphen	893 <data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010 hyphen

866	894

867 # Test for #10176 (in fi)	895 # Test for #10176 (in fi)

868 <line>	896 <line>

869 <data>•abc/•s •def•</data>	897 <data>•abc/•s •def•</data>

870 <data>•abc/\u05D9 •def•</data>	898 <data>•abc/\u05D9 •def•</data>

871 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>	899 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>

872 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data>	900 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data>

OLD	NEW

« no previous file with comments | « source/test/testdata/metaZones.txt ('k') | source/test/testdata/regextst.txt » ('j') | no next file with comments »