Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(258)

Side by Side Diff: source/test/testdata/rbbitst.txt

Issue 845603002: Update ICU to 54.1 step 1 (Closed) Base URL: https://chromium.googlesource.com/chromium/deps/icu.git@master
Patch Set: remove unusued directories Created 5 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « source/test/testdata/metaZones.txt ('k') | source/test/testdata/regextst.txt » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Copyright (c) 2001-2013 International Business Machines 1 # Copyright (c) 2001-2014 International Business Machines
2 # Corporation and others. All Rights Reserved. 2 # Corporation and others. All Rights Reserved.
3 # 3 #
4 # RBBI Test Data 4 # RBBI Test Data
5 # 5 #
6 # File: rbbitst.txt 6 # File: rbbitst.txt
7 # 7 #
8 # The format of this file looks vaguely like some kind of xml-ish markup, 8 # The format of this file looks vaguely like some kind of xml-ish markup,
9 # but it is NOT. The syntax is this.. 9 # but it is NOT. The syntax is this..
10 # 10 #
11 # <word> any following data is for word break testing 11 # <word> any following data is for word break testing
(...skipping 14 matching lines...) Expand all
26 # There are two copies of this file in the source repository, 26 # There are two copies of this file in the source repository,
27 # [ICU4C] source/test/testdata/rbbitst.txt 27 # [ICU4C] source/test/testdata/rbbitst.txt
28 # [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt 28 # [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt
29 # 29 #
30 # ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sur e they 30 # ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sur e they
31 # are merged back into ICU4C's copy of the file, lest they get overwritten late r. 31 # are merged back into ICU4C's copy of the file, lest they get overwritten late r.
32 # TODO: figure out how to have a single copy of the file for use by both C and Java. 32 # TODO: figure out how to have a single copy of the file for use by both C and Java.
33 33
34 34
35 # Temp debugging tests 35 # Temp debugging tests
36 <word> 36 <sent>
37 <data>•Isn't<200></data> 37 <data>•\u00c0.•</data>
38 <char>
39 <data>•\U00010020•\U00010000\N{COMBINING MACRON}•</data>
40 38
39 #<data>•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165:"JAVA\u821c\u8165\u7fc8 \u51ce\u306d,\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46".\u202 9•</data>
41 ################################################################################ ######## 40 ################################################################################ ########
42 # 41 #
43 # 42 #
44 # G r a p h e m e C l u s t e r T e s t s 43 # G r a p h e m e C l u s t e r T e s t s
45 # 44 #
46 # 45 #
47 ################################################################################ ########## 46 ################################################################################ ##########
48 <char> 47 <char>
49 48
50 <data>•a•b•c• •,•\u0666•</data> # Quick Test 49 <data>•a•b•c• •,•\u0666•</data> # Quick Test
(...skipping 121 matching lines...) Expand 10 before | Expand all | Expand 10 after
172 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> 171 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>
173 172
174 # Hiragana & Katakana stay together, but separates from each other and Latin. 173 # Hiragana & Katakana stay together, but separates from each other and Latin.
175 # *** what to do about theoretical combos of chars? i.e. hiragana + accent 174 # *** what to do about theoretical combos of chars? i.e. hiragana + accent
176 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINI NG ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A} \N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<400>def<200>#•</data> 175 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINI NG ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A} \N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA LETTER N}<400>def<200>#•</data>
177 176
178 # test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth 177 # test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth
179 <data>•芽キャベツ<400>芽キャベツ<400></data> 178 <data>•芽キャベツ<400>芽キャベツ<400></data>
180 179
181 # more Japanese tests 180 # more Japanese tests
182 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana 181 # TODO: some script=common characters in the Hiragana and the Katakana block may not be treated correctly
183 # and the Katakana block are not treated correctly. Enable this later. 182 # (was formerly true for U+30FC); need to check and fix if so.
184 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400> は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> 183 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400> は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>
185 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも <400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> 184 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも <400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>
186 185
187 # Testing of word boundary for dictionary word containing both kanji and kana 186 # Testing of word boundary for dictionary word containing both kanji and kana
188 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data> 187 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data>
189 188
190 # Testing of Chinese segmentation (taken from a Chinese news article) 189 # Testing of Chinese segmentation (taken from a Chinese news article)
191 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400> 到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的 <400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属 意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da ta> 190 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400> 到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的 <400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属 意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da ta>
192 191
193 # Words with interior formatting characters 192 # Words with interior formatting characters
(...skipping 392 matching lines...) Expand 10 before | Expand all | Expand 10 after
586 <title> 585 <title>
587 <data>•Here •is •a •short •sample •sentence. •And •another.•</data> 586 <data>•Here •is •a •short •sample •sentence. •And •another.•</data>
588 <data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data> 587 <data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data>
589 <data>• •Start •and •end •with •spaces •</data> 588 <data>• •Start •and •end •with •spaces •</data>
590 <data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</ data> 589 <data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</ data>
591 590
592 <data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data> 591 <data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data>
593 <data>•123 •Start •with •a •number.•</data> 592 <data>•123 •Start •with •a •number.•</data>
594 593
595 <data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data> 594 <data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data>
596 595 <data>•' '' •start •with •case-•ignorable & •case-•insensitive •cha'r'a'cter•</ data>
596 <data>• ''•aaa' •bbb '•ccc' '•ddd''' '''•eee '''•fff''' •ggg ''•</data>
597 # Note: apostrophe is case-ignorable. space is not cased.
597 598
598 ################################################################################ ########## 599 ################################################################################ ##########
599 # 600 #
600 # Thai Tests 601 # Thai Tests
601 # 602 #
602 ################################################################################ ########## 603 ################################################################################ ##########
603 <locale th> 604 <locale th>
604 <word> 605 <word>
605 # 606 #
606 # Test data originally from the test code source file 607 # Test data originally from the test code source file
(...skipping 77 matching lines...) Expand 10 before | Expand all | Expand 10 after
684 \u0e41\u0e25\u0e30•\ 685 \u0e41\u0e25\u0e30•\
685 \u0e40\u0e03\u0e35•\ 686 \u0e40\u0e03\u0e35•\
686 \u0e22\u0e07•\ 687 \u0e22\u0e07•\
687 \u0e43\u0e2b\u0e21\u0e48•</data> 688 \u0e43\u0e2b\u0e21\u0e48•</data>
688 689
689 # Test for #10296 690 # Test for #10296
690 <line> 691 <line>
691 <data>•ใช•มั้ย•</data> 692 <data>•ใช•มั้ย•</data>
692 <data>•มั๊ยล่ะ•ที่รัก•</data> 693 <data>•มั๊ยล่ะ•ที่รัก•</data>
693 694
695 # Test for #10593
696 <line>
697 <data>•เล่น•ผ่าน•ทาง•บลูทูธ•บน•อุปกรณ์•</data>
698
699 # Test for city names #10691
700 <line>
701 <data>•ไป•ที่•ซานฟรานซิสโก•</data>
702
703 # Test for #10630, #10631
704 <line>
705 <data>•แท็ก•แอปพลิเคชัน•เป็น•พิเศษ•</data>
706
694 ################################################################################ ########## 707 ################################################################################ ##########
695 # 708 #
696 # Lao Tests 709 # Lao Tests
697 # 710 #
698 ################################################################################ ########## 711 ################################################################################ ##########
699 <locale en> 712 <locale en>
700 # Basic check for #7647 713 # Basic check for #7647
701 <line> 714 <line>
702 <data>•ສະບາຍດີ•</data> 715 <data>•ສະບາຍດີ•</data>
703 <data>•ດີ•ຂອບໃຈ•</data> 716 <data>•ດີ•ຂອບໃຈ•</data>
704 <data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data> 717 <data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data>
705 <data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data> 718 <data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data>
706 719
707 ################################################################################ ########## 720 ################################################################################ ##########
708 # 721 #
722 # Burmese/Myanmar Tests
723 #
724 ################################################################################ ##########
725 <locale en>
726 # Basic sanity check for #10326 (some text from http://www.unicode.org/udhr/d/ud hr_mya.txt)
727 <line>
728 <data>•လူ•တိုင်း•သည် •တူညီ •လွတ်လပ်•သော •ဂုဏ်•သိ•က္•ခါ•ဖြ•င့် •လည်းကောင်း၊ •</da ta>
729 <data>•တူညီ•လွတ်လပ်•သော •အ•ခွ•င့်•အရေး•များ•ဖြ•င့် •လည်းကောင်း၊ •မွေး•ဖွား•လာ•သူ များ •ဖြစ်သည်။•</data>
730 <data>•ထို•သူ•တို့၌ •ပိုင်းခြား •ဝေဖန်•တတ်•သော •ဉာဏ်•နှ•င့် •ကျ•င့်•ဝတ် •သိတတ်•သ ော •စိတ်•တို့•ရှိ•ကြ၍ •</data>
731 <data>•ထို•သူ•တို့သည် •အချင်းချင်း •မေတ္တာ•ထား၍ •ဆက်ဆံ•ကျ•င့်•သုံး•</data>
732
733 ################################################################################ ##########
734 #
709 # Khmer Tests 735 # Khmer Tests
710 # 736 #
711 ################################################################################ ########## 737 ################################################################################ ##########
712 738
713 # Test data originally from http://bugs.icu-project.org/trac/search?q=r30327 739 # Test data originally from http://bugs.icu-project.org/trac/search?q=r30327
714 # from the file testdata/wordsegments.txt 740 # from the file testdata/wordsegments.txt
715 <locale en> 741 <locale en>
716 <word> 742 <word>
717 743
718 <data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data> 744 <data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data>
(...skipping 70 matching lines...) Expand 10 before | Expand all | Expand 10 after
789 <data>•abc/\u05D9 •def•</data> 815 <data>•abc/\u05D9 •def•</data>
790 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> 816 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>
791 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data> 817 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data>
792 818
793 819
794 <locale root> 820 <locale root>
795 <word> 821 <word>
796 <data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々 <400>は<400>ワード<400>で<400>ある<400>。•</data> 822 <data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々 <400>は<400>ワード<400>で<400>ある<400>。•</data>
797 # The following test is for #10300 823 # The following test is for #10300
798 <data>•例えば<400>オーストラリア<400>。•</data> 824 <data>•例えば<400>オーストラリア<400>。•</data>
825 # The following test is for #10571
826 <data>•一部<400>の<400>地域<400>では<400>、<0>ブラジル<400>、<0>インドネシア<400>、<0>オーストリア<400>、<0 >ニュージーランド<400>で<400>ある<400>。•</data>
799 827
800 # UBreakIteratorType UBRK_SENTENCE, Locale "el" 828 # UBreakIteratorType UBRK_SENTENCE, Locale "el"
801 # Add break after Greek question mark (cldrbug #2069). 829 # Add break after Greek question mark (cldrbug #2069).
802 # "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. " 830 # "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. "
803 # "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3" 831 # "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3"
804 # which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ" 832 # which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ"
805 833
806 <locale root> 834 <locale root>
807 <sent> 835 <sent>
808 <data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> 836 <data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>
809 837
810 <locale el> 838 <locale el>
811 <sent> 839 <sent>
812 <data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> 840 <data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>
813 841
814 # UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX" 842 # UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX"
815 # Words don't include colon or period (cldrbug #1969). 843 # Words don't include colon or period (cldrbug #1969).
816 844
817 <locale en_US> 845 <locale en_US>
818 <word> 846 <word>
819 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct. field<200> \ 847 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct. field<200> \
820 •for<200> •CS<200>-•types<200>.•</data> 848 •for<200> •CS<200>-•types<200>.•</data>
821 <data>•\uFF92\uFF76\uFF9E<400> •</data> 849 <data>•\uFF92\uFF76\uFF9E<400> •</data>
822 850
823 <locale en_US_POSIX> 851 <locale en_US_POSIX>
824 <word> 852 <word>
825 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct< 200>.•field<200> \ 853 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx<200>:•yy<200> •or<200> •s truct<200>.•field<200> \
826 •for<200> •CS<200>-•types<200>.•</data> 854 •for<200> •CS<200>-•types<200>.•</data>
827 <data>•\u06c9<200>\uc799\ufffa•</data> 855 <data>•\u06c9<200>\uc799\ufffa•</data>
828 <data>•\uFF92\uFF76\uFF9E<400> •</data> 856 <data>•\uFF92\uFF76\uFF9E<400> •</data>
829 857
830 858
831 # UBreakIteratorType UBRK_CHARACTER, Locale "th" 859 # UBreakIteratorType UBRK_CHARACTER, Locale "th"
832 # Clusters should not include spacing Thai/Lao vowels (prefix or postfix), excep t for [SARA] AM (cldrbug #2161). 860 # Clusters should not include spacing Thai/Lao vowels (prefix or postfix), excep t for [SARA] AM (cldrbug #2161).
833 # Update: As of Unicode 6.1 root has same behavior as th for this. 861 # Update: As of Unicode 6.1 root has same behavior as th for this.
834 # 862 #
835 # "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 " 863 # "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 "
(...skipping 27 matching lines...) Expand all
863 891
864 <data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen 892 <data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen
865 <data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010 hyphen 893 <data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010 hyphen
866 894
867 # Test for #10176 (in fi) 895 # Test for #10176 (in fi)
868 <line> 896 <line>
869 <data>•abc/•s •def•</data> 897 <data>•abc/•s •def•</data>
870 <data>•abc/\u05D9 •def•</data> 898 <data>•abc/\u05D9 •def•</data>
871 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> 899 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>
872 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data> 900 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D D/\u05D9\u05D5\u05EA•</data>
OLDNEW
« no previous file with comments | « source/test/testdata/metaZones.txt ('k') | source/test/testdata/regextst.txt » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698