OLD | NEW |
1 # Copyright (c) 2001-2013 International Business Machines | 1 # Copyright (c) 2001-2014 International Business Machines |
2 # Corporation and others. All Rights Reserved. | 2 # Corporation and others. All Rights Reserved. |
3 # | 3 # |
4 # RBBI Test Data | 4 # RBBI Test Data |
5 # | 5 # |
6 # File: rbbitst.txt | 6 # File: rbbitst.txt |
7 # | 7 # |
8 # The format of this file looks vaguely like some kind of xml-ish markup, | 8 # The format of this file looks vaguely like some kind of xml-ish markup, |
9 # but it is NOT. The syntax is this.. | 9 # but it is NOT. The syntax is this.. |
10 # | 10 # |
11 # <word> any following data is for word break testing | 11 # <word> any following data is for word break testing |
(...skipping 14 matching lines...) Expand all Loading... |
26 # There are two copies of this file in the source repository, | 26 # There are two copies of this file in the source repository, |
27 # [ICU4C] source/test/testdata/rbbitst.txt | 27 # [ICU4C] source/test/testdata/rbbitst.txt |
28 # [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt | 28 # [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt |
29 # | 29 # |
30 # ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sur
e they | 30 # ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sur
e they |
31 # are merged back into ICU4C's copy of the file, lest they get overwritten late
r. | 31 # are merged back into ICU4C's copy of the file, lest they get overwritten late
r. |
32 # TODO: figure out how to have a single copy of the file for use by both C and
Java. | 32 # TODO: figure out how to have a single copy of the file for use by both C and
Java. |
33 | 33 |
34 | 34 |
35 # Temp debugging tests | 35 # Temp debugging tests |
36 <word> | 36 <sent> |
37 <data>•Isn't<200></data> | 37 <data>•\u00c0.•</data> |
38 <char> | |
39 <data>•\U00010020•\U00010000\N{COMBINING MACRON}•</data> | |
40 | 38 |
| 39 #<data>•\u5487\u67ff\ue591\u5017\u61b3\u60a1\u9510\u8165:"JAVA\u821c\u8165\u7fc8
\u51ce\u306d,\u2494\u56d8\u4ec0\u60b1\u8560\u51ba\u611d\u57b6\u2510\u5d46".\u202
9•</data> |
41 ################################################################################
######## | 40 ################################################################################
######## |
42 # | 41 # |
43 # | 42 # |
44 # G r a p h e m e C l u s t e r T e s t s | 43 # G r a p h e m e C l u s t e r T e s t s |
45 # | 44 # |
46 # | 45 # |
47 ################################################################################
########## | 46 ################################################################################
########## |
48 <char> | 47 <char> |
49 | 48 |
50 <data>•a•b•c• •,•\u0666•</data> # Quick Test | 49 <data>•a•b•c• •,•\u0666•</data> # Quick Test |
(...skipping 121 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
172 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> | 171 <data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data> |
173 | 172 |
174 # Hiragana & Katakana stay together, but separates from each other and Latin. | 173 # Hiragana & Katakana stay together, but separates from each other and Latin. |
175 # *** what to do about theoretical combos of chars? i.e. hiragana + accent | 174 # *** what to do about theoretical combos of chars? i.e. hiragana + accent |
176 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINI
NG ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A}
\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA
LETTER N}<400>def<200>#•</data> | 175 #<data>•abc<200>\N{HIRAGANA LETTER SMALL A}<400>\N{HIRAGANA LETTER VU}\N{COMBINI
NG ACUTE ACCENT}<400>\N{HIRAGANA ITERATION MARK}<400>\N{KATAKANA LETTER SMALL A}
\N{KATAKANA ITERATION MARK}\N{HALFWIDTH KATAKANA LETTER WO}\N{HALFWIDTH KATAKANA
LETTER N}<400>def<200>#•</data> |
177 | 176 |
178 # test normalization/dictionary handling of halfwidth katakana: same dictionary
phrase in fullwidth and halfwidth | 177 # test normalization/dictionary handling of halfwidth katakana: same dictionary
phrase in fullwidth and halfwidth |
179 <data>•芽キャベツ<400>芽キャベツ<400></data> | 178 <data>•芽キャベツ<400>芽キャベツ<400></data> |
180 | 179 |
181 # more Japanese tests | 180 # more Japanese tests |
182 # TODO: Currently, U+30FC and other characters (script=common) in the Hiragana | 181 # TODO: some script=common characters in the Hiragana and the Katakana block may
not be treated correctly |
183 # and the Katakana block are not treated correctly. Enable this later. | 182 # (was formerly true for U+30FC); need to check and fix if so. |
184 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>
は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> | 183 #<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>
は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> |
185 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも
<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> | 184 <data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも
<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data> |
186 | 185 |
187 # Testing of word boundary for dictionary word containing both kanji and kana | 186 # Testing of word boundary for dictionary word containing both kanji and kana |
188 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data> | 187 <data>•中だるみ<400>蔵王の森<400>ウ離島<400></data> |
189 | 188 |
190 # Testing of Chinese segmentation (taken from a Chinese news article) | 189 # Testing of Chinese segmentation (taken from a Chinese news article) |
191 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400>
到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的
<400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属
意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da
ta> | 190 <data>•400<100>余<400>名<400>中央<400>委员<400>和<400>中央<400>候补<400>委员<400>都<400>领<400>
到了<400>“•推荐<400>票<400>”•,•有<400>资格<400>在<400>200<100>多<400>名<400>符合<400>条件<400>的
<400>63<100>岁<400>以下<400>中共<400>正<400>部<400>级<400>干部<400>中<400>,•选出<400>他们<400>属
意<400>的<400>中央<400>政治局<400>委员<400>以<400>向<400>政治局<400>常委<400>会<400>举荐<400>。•</da
ta> |
192 | 191 |
193 # Words with interior formatting characters | 192 # Words with interior formatting characters |
(...skipping 392 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
586 <title> | 585 <title> |
587 <data>•Here •is •a •short •sample •sentence. •And •another.•</data> | 586 <data>•Here •is •a •short •sample •sentence. •And •another.•</data> |
588 <data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data> | 587 <data>•HERE •IS •A •SHORT •SAMPLE •SENTENCE. •AND •ANOTHER.•</data> |
589 <data>• •Start •and •end •with •spaces •</data> | 588 <data>• •Start •and •end •with •spaces •</data> |
590 <data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</
data> | 589 <data>•Include 123 456 ^& •some 54332 •numbers 4445•abc123•abc •ending 1223 •</
data> |
591 | 590 |
592 <data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data> | 591 <data>•Combining\u0301 \u0301•ma\u0306rks •bye •</data> |
593 <data>•123 •Start •with •a •number.•</data> | 592 <data>•123 •Start •with •a •number.•</data> |
594 | 593 |
595 <data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data> | 594 <data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data> |
596 | 595 <data>•' '' •start •with •case-•ignorable & •case-•insensitive •cha'r'a'cter•</
data> |
| 596 <data>• ''•aaa' •bbb '•ccc' '•ddd''' '''•eee '''•fff''' •ggg ''•</data> |
| 597 # Note: apostrophe is case-ignorable. space is not cased. |
597 | 598 |
598 ################################################################################
########## | 599 ################################################################################
########## |
599 # | 600 # |
600 # Thai Tests | 601 # Thai Tests |
601 # | 602 # |
602 ################################################################################
########## | 603 ################################################################################
########## |
603 <locale th> | 604 <locale th> |
604 <word> | 605 <word> |
605 # | 606 # |
606 # Test data originally from the test code source file | 607 # Test data originally from the test code source file |
(...skipping 77 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
684 \u0e41\u0e25\u0e30•\ | 685 \u0e41\u0e25\u0e30•\ |
685 \u0e40\u0e03\u0e35•\ | 686 \u0e40\u0e03\u0e35•\ |
686 \u0e22\u0e07•\ | 687 \u0e22\u0e07•\ |
687 \u0e43\u0e2b\u0e21\u0e48•</data> | 688 \u0e43\u0e2b\u0e21\u0e48•</data> |
688 | 689 |
689 # Test for #10296 | 690 # Test for #10296 |
690 <line> | 691 <line> |
691 <data>•ใช•มั้ย•</data> | 692 <data>•ใช•มั้ย•</data> |
692 <data>•มั๊ยล่ะ•ที่รัก•</data> | 693 <data>•มั๊ยล่ะ•ที่รัก•</data> |
693 | 694 |
| 695 # Test for #10593 |
| 696 <line> |
| 697 <data>•เล่น•ผ่าน•ทาง•บลูทูธ•บน•อุปกรณ์•</data> |
| 698 |
| 699 # Test for city names #10691 |
| 700 <line> |
| 701 <data>•ไป•ที่•ซานฟรานซิสโก•</data> |
| 702 |
| 703 # Test for #10630, #10631 |
| 704 <line> |
| 705 <data>•แท็ก•แอปพลิเคชัน•เป็น•พิเศษ•</data> |
| 706 |
694 ################################################################################
########## | 707 ################################################################################
########## |
695 # | 708 # |
696 # Lao Tests | 709 # Lao Tests |
697 # | 710 # |
698 ################################################################################
########## | 711 ################################################################################
########## |
699 <locale en> | 712 <locale en> |
700 # Basic check for #7647 | 713 # Basic check for #7647 |
701 <line> | 714 <line> |
702 <data>•ສະບາຍດີ•</data> | 715 <data>•ສະບາຍດີ•</data> |
703 <data>•ດີ•ຂອບໃຈ•</data> | 716 <data>•ດີ•ຂອບໃຈ•</data> |
704 <data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data> | 717 <data>•ເຈົ້າ•ເວົ້າ•ພາສາ•ອັງກິດ•ໄດ້•ບໍ່•</data> |
705 <data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data> | 718 <data>•ກະລຸນາ•ເວົ້າ•ຊ້າ•ໆ•</data> |
706 | 719 |
707 ################################################################################
########## | 720 ################################################################################
########## |
708 # | 721 # |
| 722 # Burmese/Myanmar Tests |
| 723 # |
| 724 ################################################################################
########## |
| 725 <locale en> |
| 726 # Basic sanity check for #10326 (some text from http://www.unicode.org/udhr/d/ud
hr_mya.txt) |
| 727 <line> |
| 728 <data>•လူ•တိုင်း•သည် •တူညီ •လွတ်လပ်•သော •ဂုဏ်•သိ•က္•ခါ•ဖြ•င့် •လည်းကောင်း၊ •</da
ta> |
| 729 <data>•တူညီ•လွတ်လပ်•သော •အ•ခွ•င့်•အရေး•များ•ဖြ•င့် •လည်းကောင်း၊ •မွေး•ဖွား•လာ•သူ
များ •ဖြစ်သည်။•</data> |
| 730 <data>•ထို•သူ•တို့၌ •ပိုင်းခြား •ဝေဖန်•တတ်•သော •ဉာဏ်•နှ•င့် •ကျ•င့်•ဝတ် •သိတတ်•သ
ော •စိတ်•တို့•ရှိ•ကြ၍ •</data> |
| 731 <data>•ထို•သူ•တို့သည် •အချင်းချင်း •မေတ္တာ•ထား၍ •ဆက်ဆံ•ကျ•င့်•သုံး•</data> |
| 732 |
| 733 ################################################################################
########## |
| 734 # |
709 # Khmer Tests | 735 # Khmer Tests |
710 # | 736 # |
711 ################################################################################
########## | 737 ################################################################################
########## |
712 | 738 |
713 # Test data originally from http://bugs.icu-project.org/trac/search?q=r30327 | 739 # Test data originally from http://bugs.icu-project.org/trac/search?q=r30327 |
714 # from the file testdata/wordsegments.txt | 740 # from the file testdata/wordsegments.txt |
715 <locale en> | 741 <locale en> |
716 <word> | 742 <word> |
717 | 743 |
718 <data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data> | 744 <data>•តើ<200>លោក<200>មក<200>ពី<200>ប្រទេស<200>ណា<200></data> |
(...skipping 70 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
789 <data>•abc/\u05D9 •def•</data> | 815 <data>•abc/\u05D9 •def•</data> |
790 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> | 816 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> |
791 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D
D/\u05D9\u05D5\u05EA•</data> | 817 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D
D/\u05D9\u05D5\u05EA•</data> |
792 | 818 |
793 | 819 |
794 <locale root> | 820 <locale root> |
795 <word> | 821 <word> |
796 <data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々
<400>は<400>ワード<400>で<400>ある<400>。•</data> | 822 <data>•私<400>達<400>に<400>一<400>〇<400>〇〇<400>の<400>コンピュータ<400>が<400>ある<400>。<0>奈々
<400>は<400>ワード<400>で<400>ある<400>。•</data> |
797 # The following test is for #10300 | 823 # The following test is for #10300 |
798 <data>•例えば<400>オーストラリア<400>。•</data> | 824 <data>•例えば<400>オーストラリア<400>。•</data> |
| 825 # The following test is for #10571 |
| 826 <data>•一部<400>の<400>地域<400>では<400>、<0>ブラジル<400>、<0>インドネシア<400>、<0>オーストリア<400>、<0
>ニュージーランド<400>で<400>ある<400>。•</data> |
799 | 827 |
800 # UBreakIteratorType UBRK_SENTENCE, Locale "el" | 828 # UBreakIteratorType UBRK_SENTENCE, Locale "el" |
801 # Add break after Greek question mark (cldrbug #2069). | 829 # Add break after Greek question mark (cldrbug #2069). |
802 # "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. " | 830 # "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. " |
803 # "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3" | 831 # "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3" |
804 # which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ" | 832 # which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ" |
805 | 833 |
806 <locale root> | 834 <locale root> |
807 <sent> | 835 <sent> |
808 <data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> | 836 <data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> |
809 | 837 |
810 <locale el> | 838 <locale el> |
811 <sent> | 839 <sent> |
812 <data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> | 840 <data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data> |
813 | 841 |
814 # UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX" | 842 # UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX" |
815 # Words don't include colon or period (cldrbug #1969). | 843 # Words don't include colon or period (cldrbug #1969). |
816 | 844 |
817 <locale en_US> | 845 <locale en_US> |
818 <word> | 846 <word> |
819 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct.
field<200> \ | 847 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct.
field<200> \ |
820 •for<200> •CS<200>-•types<200>.•</data> | 848 •for<200> •CS<200>-•types<200>.•</data> |
821 <data>•\uFF92\uFF76\uFF9E<400> •</data> | 849 <data>•\uFF92\uFF76\uFF9E<400> •</data> |
822 | 850 |
823 <locale en_US_POSIX> | 851 <locale en_US_POSIX> |
824 <word> | 852 <word> |
825 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct<
200>.•field<200> \ | 853 <data>•Can't<200> •have<200> •breaks<200> •in<200> •xx<200>:•yy<200> •or<200> •s
truct<200>.•field<200> \ |
826 •for<200> •CS<200>-•types<200>.•</data> | 854 •for<200> •CS<200>-•types<200>.•</data> |
827 <data>•\u06c9<200>\uc799\ufffa•</data> | 855 <data>•\u06c9<200>\uc799\ufffa•</data> |
828 <data>•\uFF92\uFF76\uFF9E<400> •</data> | 856 <data>•\uFF92\uFF76\uFF9E<400> •</data> |
829 | 857 |
830 | 858 |
831 # UBreakIteratorType UBRK_CHARACTER, Locale "th" | 859 # UBreakIteratorType UBRK_CHARACTER, Locale "th" |
832 # Clusters should not include spacing Thai/Lao vowels (prefix or postfix), excep
t for [SARA] AM (cldrbug #2161). | 860 # Clusters should not include spacing Thai/Lao vowels (prefix or postfix), excep
t for [SARA] AM (cldrbug #2161). |
833 # Update: As of Unicode 6.1 root has same behavior as th for this. | 861 # Update: As of Unicode 6.1 root has same behavior as th for this. |
834 # | 862 # |
835 # "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 " | 863 # "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 " |
(...skipping 27 matching lines...) Expand all Loading... |
863 | 891 |
864 <data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen | 892 <data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen |
865 <data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010
hyphen | 893 <data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010
hyphen |
866 | 894 |
867 # Test for #10176 (in fi) | 895 # Test for #10176 (in fi) |
868 <line> | 896 <line> |
869 <data>•abc/•s •def•</data> | 897 <data>•abc/•s •def•</data> |
870 <data>•abc/\u05D9 •def•</data> | 898 <data>•abc/\u05D9 •def•</data> |
871 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> | 899 <data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data> |
872 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D
D/\u05D9\u05D5\u05EA•</data> | 900 <data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05D
D/\u05D9\u05D5\u05EA•</data> |
OLD | NEW |