Index: source/data/unidata/changes.txt |
diff --git a/source/data/unidata/changes.txt b/source/data/unidata/changes.txt |
index 23f29bf2e3b88993a78621a62378fdf613a087c6..5febb0745ed01768dab97de15ec55234a8b37cfb 100644 |
--- a/source/data/unidata/changes.txt |
+++ b/source/data/unidata/changes.txt |
@@ -1,4 +1,4 @@ |
-* Copyright (C) 2004-2014, International Business Machines |
+* Copyright (C) 2004-2015, International Business Machines |
* Corporation and others. All Rights Reserved. |
* |
* file name: changes.txt |
@@ -13,36 +13,338 @@ |
---------------------------------------------------------------------------- *** |
-Unicode 8.0 update for ICU ?? |
+* New ISO 15924 script codes |
-* UCA issue from 7.0 |
+Starting with ICU 55, we do not add UScriptCode constants any more until their scripts |
+are encoded in Unicode, or can be assumed to be encoded in the next Unicode version. |
+Script enum constant names want to follow the Unicode script property value aliases, |
+which are assigned only when the scripts are encoded. |
+When we encode scripts early and guess wrong, then we have confusing enum constants |
+and have sometimes added aliases. |
-- U+1DE9 COMBINING LATIN SMALL LETTER BETA |
- sorts with Greek Beta, should sort with Latin B? |
- + Ken says: |
- No, it was deliberate: |
+Exception: Script codes like Latf and Aran that are not subject to separate encoding |
+can be added at any time. |
- 03B2;GREEK SMALL LETTER BETA;Ll;;;;0392;;0392 |
- 1D5D;MODIFIER LETTER SMALL BETA;Lm;<super> 03B2;;;;; |
- 1DE9;COMBINING LATIN SMALL LETTER BETA;Mn;<sort> 03B2;;;;; |
- 1D66;GREEK SUBSCRIPT SMALL LETTER BETA;Ll;<sub> 03B2;;;;; |
+Script codes not yet in ICU: http://www.unicode.org/iso15924/codechanges.html |
- Note the relationship to U+1D5D. |
+Added 2014-11-15, see http://bugs.icu-project.org/trac/ticket/11561 |
+- Adlm 166 Adlam |
+- Aran 161 Arabic (Nastaliq variant) |
+- Kitl 505 Khitan large script |
+- Kits 288 Khitan small script |
+- Marc 332 Marchen |
+- Osge 219 Osage |
- When the disunified *Latin* beta base letter shows up in Unicode 8.0: |
+Aran can be added as USCRIPT_ARABIC_NASTALIQ at any time. |
- U+A7B4 LATIN CAPITAL LETTER BETA |
- U+A7B5 LATIN SMALL LETTER BETA |
+Adlam, Marchen, and Osage are expected to go into Unicode 9; |
+we should assign Unicode script property value aliases for them |
+soon after Unicode 8 is released, and add them in ICU 56. |
- we could re-evaluate what U+1DE9 equates to, for collation, |
- but currently there isn’t any Latin beta to serve that function |
- in Unicode 7.0. |
+Khitan scripts will be encoded later. |
-- ICU_ROOT=~/svn.icu/trunk |
-- ICU_SRC_DIR=$ICU_ROOT/src |
+---------------------------------------------------------------------------- *** |
+ |
+Unicode 8.0 update for ICU 56 |
+ |
+* Command-line environment setup |
+ |
+ICU_ROOT=~/svn.icu/trunk |
+ICU_SRC_DIR=$ICU_ROOT/src |
+ICUDT=icudt56b |
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in |
+UNIDATA=$ICU_SRC_DIR/source/data/unidata |
+ |
+http://www.unicode.org/review/pri297/ -- beta review |
+http://www.unicode.org/reports/uax-proposed-updates.html |
+http://unicode.org/versions/beta-8.0.0.html |
+http://www.unicode.org/versions/Unicode8.0.0/ |
+http://www.unicode.org/reports/tr44/tr44-15.html |
+ |
+*** ICU Trac |
+ |
+- ticket:11574: Unicode 8 |
+- C++ branches/markus/uni80 at r37351 from trunk at r37343 |
+- Java branches/markus/uni80 at r37352 from trunk at r37338 |
+ |
+*** CLDR Trac |
+ |
+- cldrbug 8311: UCA 8 |
+- branches/markus/uni80 at r11518 from trunk at r11517 |
+ |
+- cldrbug 8109: Unicode 8.0 script metadata |
+- cldrbug 8418: Updated segmentation for Unicode 8.0 |
+ |
+*** Unicode version numbers |
+- makedata.mak |
+- uchar.h |
+- com.ibm.icu.util.VersionInfo |
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
+ |
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
+ so that the makefiles see the new version number. |
+ |
+*** data files & enums & parser code |
+ |
+* file preparation |
+ |
+- download UCD & IDNA files |
+- make sure that the Unicode data folder passed into preparseucd.py |
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) |
+- only for manual diffs: remove version suffixes from the file names |
+ ~/unidata/uni70/20140403$ ../../desuffixucd.py . |
+ (see https://sites.google.com/site/unicodetools/inputdata) |
+- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src |
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
+ |
+- also: from http://unicode.org/Public/security/8.0.0/ download new |
+ confusables.txt & confusablesWholeScript.txt |
+ and copy to $UNIDATA |
+ ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA |
+ ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA |
+ |
+* initial preparseucd.py changes |
+- remove new Unicode scripts from the |
+ only-in-ISO-15924 list according to the error message: |
+ ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] |
+ from _scripts_only_in_iso15924 |
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
+ and in com.ibm.icu.dev.test.lang.TestUScript.java |
+- property and file name change: |
+ IndicMatraCategory -> IndicPositionalCategory |
+- UnicodeData.txt unusual numeric values (improper fractions) |
+ 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; |
+ 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; |
+ 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; |
+ 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; |
+ 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; |
+ 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; |
+ 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; |
+ 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; |
+ 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; |
+ 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; |
+ -> change preparseucd.py to map them to proper fractions (e.g., 1/6) |
+ which are listed in DerivedNumericValues.txt; |
+ keeps storage in data file simple |
+ |
+* PropertyValueAliases.txt changes |
+- 10 new Block (blk) values: |
+ blk; Ahom ; Ahom |
+ blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs |
+ blk; Cherokee_Sup ; Cherokee_Supplement |
+ blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E |
+ blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform |
+ blk; Hatran ; Hatran |
+ blk; Multani ; Multani |
+ blk; Old_Hungarian ; Old_Hungarian |
+ blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs |
+ blk; Sutton_SignWriting ; Sutton_SignWriting |
+ -> add to uchar.h |
+ use long property names for enum constants |
+ -> add to UCharacter.UnicodeBlock IDs |
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
+ replace public static final int \1_ID = \2; \3 |
+ -> add to UCharacter.UnicodeBlock objects |
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
+- 6 new Script (sc) values: |
+ sc ; Ahom ; Ahom |
+ sc ; Hatr ; Hatran |
+ sc ; Hluw ; Anatolian_Hieroglyphs |
+ sc ; Hung ; Old_Hungarian |
+ sc ; Mult ; Multani |
+ sc ; Sgnw ; SignWriting |
+ -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript |
+ |
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
+ (not strictly necessary for NOT_ENCODED scripts) |
+ ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt |
+ |
+* generate normalization data files |
+ cd $ICU_ROOT/dbg |
+ bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource |
+ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
+ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
+ |
+* build ICU (make install) |
+ so that the tools build can pick up the new definitions from the installed header files. |
+ |
+ $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt |
+ |
+* build Unicode tools using CMake+make |
+ |
+~/svn.icutools/trunk/src/unicode/c/icudefs.txt: |
+ |
+ # Location (--prefix) of where ICU was installed. |
+ set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) |
+ # Location of the ICU source tree. |
+ set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) |
+ |
+ ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c |
+ ~/svn.icutools/trunk/dbg/unicode/c$ make |
+ |
+* generate core properties data files |
+- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR |
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR |
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR |
+- rebuild ICU (make install) & tools |
+- run genuca again (see step above) so that it picks up the new nfc.nrm |
+- rebuild ICU (make install) & tools |
+ |
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
+- Unicode 6.0..8.0: U+2260, U+226E, U+226F |
+- nothing new in 8.0, no test file to update |
+ |
+* run & fix ICU4C tests |
+- bad Cherokee case folding due to difference in fallbacks: |
+ UCD case folding falls back to no mapping, |
+ ICU runtime case folding falls back to lowercasing; |
+ fixed casepropsbuilder.cpp to generate scf mappings to self |
+ when there is an slc mapping but no scf |
+- Andy handles RBBI & spoof check test failures |
+ |
+* collation: CLDR collation root, UCA DUCET |
+ |
+- UCA DUCET goes into Mark's Unicode tools, see |
+ https://sites.google.com/site/unicodetools/home#TOC-UCA |
+- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ |
+- cd (CLDR UCA branch)/common/uca/ |
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
+ cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt |
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
+ cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt |
+ (note removing the underscore before "Rules") |
+ cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
+- restore TODO diffs in UCARules.txt |
+ meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt |
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
+ from the CLDR root files (..._CLDR_..._SHORT.txt) |
+ cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
+ cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
+ cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data |
+- if CLDR common/uca/unihan-index.txt changes, then update |
+ CLDR common/collation/root.xml <collation type="private-unihan"> |
+ and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt |
+- run genuca, see command line above; |
+ deal with |
+ Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt |
+ (add the character to genuca.cpp sampleCharsToScripts[]) |
+ + look up the script for the new sample characters |
+ (e.g., in FractionalUCA.txt) |
+ + *add* mappings to sampleCharsToScripts[], do not replace them |
+ (in case the script sample characters flip-flop) |
+ + insert new scripts in DUCET script order, see the top_byte table |
+ at the beginning of FractionalUCA.txt |
+- rebuild ICU4C |
+ |
+* run & fix ICU4C tests, now with new CLDR collation root data |
+- run all tests with the collation test data *_SHORT.txt or the full files |
+ (the full ones have comments, useful for debugging) |
+- note on intltest: if collate/UCAConformanceTest fails, then |
+ utility/MultithreadTest/TestCollators will fail as well; |
+ fix the conformance test before looking into the multi-thread test |
+- fixed bug in CollationWeights::getWeightRanges() |
+ exposed by new data and CollationTest::TestRootElements |
+ |
+* update Java data files |
+- refresh just the UCD/UCA-related/derived files, just to be safe |
+- see (ICU4C)/source/data/icu4j-readme.txt |
+- mkdir /tmp/icu4j |
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
+ output: |
+ ... |
+ Unicode .icu files built to ./out/build/icudt56l |
+ echo timestamp > uni-core-data |
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b |
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b |
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b |
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" |
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ |
+ mkdir -p /tmp/icu4j/main/shared/data |
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ |
+ mkdir -p /tmp/icu4j/main/shared/data |
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
+ make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' |
+- copy the big-endian Unicode data files to another location, |
+ separate from the other data files, |
+ and then refresh ICU4J |
+ cd ~/svn.icu/trunk/dbg/data/out/icu4j |
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
+ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
+* When refreshing all of ICU4J data from ICU4C |
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
+or |
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
+ |
+* update CollationFCD.java |
+ + copy & paste the initializers of lcccIndex[] etc. from |
+ ICU4C/source/i18n/collationfcd.cpp to |
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
+ |
+* refresh Java test .txt files |
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
+ cd $ICU_SRC_DIR/source/data/unidata |
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
+ cd ../../test/testdata |
+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
+ cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
+ |
+* run & fix ICU4J tests |
+ |
+*** LayoutEngine script information |
+ |
+* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, |
+ because the layout engine was deprecated in ICU 54. |
+ Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java |
+ to write lines that we used to add manually. |
+ |
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. |
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp |
+ in the working directory. |
+ |
+ (It also generates ScriptRunData.cpp, which is no longer needed.) |
+ |
+ It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages |
+ (a plain text file) |
+ which maps ICU versions to the numbers of script/language constants |
+ that were added then. |
+ (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) |
+ |
+ The generated files have a current copyright date and "@deprecated" statement. |
+ |
+* Review changes, fix Java tool if necessary, and copy to ICU4C |
+ cd ~/svn.icu4j/trunk/src |
+ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout |
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout |
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout |
+ |
+*** API additions |
+- send notice to icu-design about new born-@stable API (enum constants etc.) |
+ |
+*** merge the Unicode update branches back onto the trunk |
+- do not merge the icudata.jar and testdata.jar, |
+ instead rebuild them from merged & tested ICU4C |
+- make sure that changes to Unicode tools & ICU tools are checked in |
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools |
+ http://bugs.icu-project.org/trac/log/tools/trunk |
---------------------------------------------------------------------------- *** |