source/data/unidata/changes.txt - Issue 1621843002: ICU 56 update step 1

Unified Diff: source/data/unidata/changes.txt

Issue 1621843002: ICU 56 update step 1 (Closed) Base URL: https://chromium.googlesource.com/chromium/deps/icu.git@561

Patch Set: Created 4 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: source/data/unidata/changes.txt

diff --git a/source/data/unidata/changes.txt b/source/data/unidata/changes.txt

index 23f29bf2e3b88993a78621a62378fdf613a087c6..5febb0745ed01768dab97de15ec55234a8b37cfb 100644

--- a/source/data/unidata/changes.txt

+++ b/source/data/unidata/changes.txt

@@ -1,4 +1,4 @@

* file name: changes.txt

@@ -13,36 +13,338 @@

---------------------------------------------------------------------------- ***

-Unicode 8.0 update for ICU ??

+* New ISO 15924 script codes

-* UCA issue from 7.0

+Starting with ICU 55, we do not add UScriptCode constants any more until their scripts

+are encoded in Unicode, or can be assumed to be encoded in the next Unicode version.

+Script enum constant names want to follow the Unicode script property value aliases,

+which are assigned only when the scripts are encoded.

+When we encode scripts early and guess wrong, then we have confusing enum constants

+and have sometimes added aliases.

-- U+1DE9 COMBINING LATIN SMALL LETTER BETA

- sorts with Greek Beta, should sort with Latin B?

- + Ken says:

- No, it was deliberate:

+Exception: Script codes like Latf and Aran that are not subject to separate encoding

+can be added at any time.

- 03B2;GREEK SMALL LETTER BETA;Ll;;;;0392;;0392

- 1D5D;MODIFIER LETTER SMALL BETA;Lm;<super> 03B2;;;;;

- 1DE9;COMBINING LATIN SMALL LETTER BETA;Mn;<sort> 03B2;;;;;

- 1D66;GREEK SUBSCRIPT SMALL LETTER BETA;Ll;<sub> 03B2;;;;;

+Script codes not yet in ICU: http://www.unicode.org/iso15924/codechanges.html

- Note the relationship to U+1D5D.

+Added 2014-11-15, see http://bugs.icu-project.org/trac/ticket/11561

+- Adlm 166 Adlam

+- Aran 161 Arabic (Nastaliq variant)

+- Kitl 505 Khitan large script

+- Kits 288 Khitan small script

+- Marc 332 Marchen

+- Osge 219 Osage

- When the disunified *Latin* beta base letter shows up in Unicode 8.0:

+Aran can be added as USCRIPT_ARABIC_NASTALIQ at any time.

- U+A7B4 LATIN CAPITAL LETTER BETA

- U+A7B5 LATIN SMALL LETTER BETA

+Adlam, Marchen, and Osage are expected to go into Unicode 9;

+we should assign Unicode script property value aliases for them

+soon after Unicode 8 is released, and add them in ICU 56.

- we could re-evaluate what U+1DE9 equates to, for collation,

- but currently there isn’t any Latin beta to serve that function

- in Unicode 7.0.

+Khitan scripts will be encoded later.

-- ICU_ROOT=~/svn.icu/trunk

-- ICU_SRC_DIR=$ICU_ROOT/src

+---------------------------------------------------------------------------- ***

+Unicode 8.0 update for ICU 56

+* Command-line environment setup

+ICU_ROOT=~/svn.icu/trunk

+ICU_SRC_DIR=$ICU_ROOT/src

+ICUDT=icudt56b

+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in

+UNIDATA=$ICU_SRC_DIR/source/data/unidata

+http://www.unicode.org/review/pri297/ -- beta review

+http://www.unicode.org/reports/uax-proposed-updates.html

+http://unicode.org/versions/beta-8.0.0.html

+http://www.unicode.org/versions/Unicode8.0.0/

+http://www.unicode.org/reports/tr44/tr44-15.html

+*** ICU Trac

+- ticket:11574: Unicode 8

+- C++ branches/markus/uni80 at r37351 from trunk at r37343

+- Java branches/markus/uni80 at r37352 from trunk at r37338

+*** CLDR Trac

+- cldrbug 8311: UCA 8

+- branches/markus/uni80 at r11518 from trunk at r11517

+- cldrbug 8109: Unicode 8.0 script metadata

+- cldrbug 8418: Updated segmentation for Unicode 8.0

+*** Unicode version numbers

+- makedata.mak

+- uchar.h

+- com.ibm.icu.util.VersionInfo

+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

+ so that the makefiles see the new version number.

+*** data files & enums & parser code

+* file preparation

+- download UCD & IDNA files

+- make sure that the Unicode data folder passed into preparseucd.py

+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)

+- only for manual diffs: remove version suffixes from the file names

+ ~/unidata/uni70/20140403$ ../../desuffixucd.py .

+ (see https://sites.google.com/site/unicodetools/inputdata)

+- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src

+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

+- also: from http://unicode.org/Public/security/8.0.0/ download new

+ confusables.txt & confusablesWholeScript.txt

+ and copy to $UNIDATA

+ ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA

+ ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA

+* initial preparseucd.py changes

+- remove new Unicode scripts from the

+ only-in-ISO-15924 list according to the error message:

+ ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']

+ from _scripts_only_in_iso15924

+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

+ and in com.ibm.icu.dev.test.lang.TestUScript.java

+- property and file name change:

+ IndicMatraCategory -> IndicPositionalCategory

+- UnicodeData.txt unusual numeric values (improper fractions)

+ 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;

+ 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;

+ 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;

+ 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;

+ 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;

+ 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;

+ 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;

+ 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;

+ 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;

+ 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;

+ -> change preparseucd.py to map them to proper fractions (e.g., 1/6)

+ which are listed in DerivedNumericValues.txt;

+ keeps storage in data file simple

+* PropertyValueAliases.txt changes

+- 10 new Block (blk) values:

+ blk; Ahom ; Ahom

+ blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs

+ blk; Cherokee_Sup ; Cherokee_Supplement

+ blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E

+ blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform

+ blk; Hatran ; Hatran

+ blk; Multani ; Multani

+ blk; Old_Hungarian ; Old_Hungarian

+ blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs

+ blk; Sutton_SignWriting ; Sutton_SignWriting

+ -> add to uchar.h

+ use long property names for enum constants

+ -> add to UCharacter.UnicodeBlock IDs

+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

+ replace public static final int \1_ID = \2; \3

+ -> add to UCharacter.UnicodeBlock objects

+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

+- 6 new Script (sc) values:

+ sc ; Ahom ; Ahom

+ sc ; Hatr ; Hatran

+ sc ; Hluw ; Anatolian_Hieroglyphs

+ sc ; Hung ; Old_Hungarian

+ sc ; Mult ; Multani

+ sc ; Sgnw ; SignWriting

+ -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript

+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

+ (not strictly necessary for NOT_ENCODED scripts)

+ ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt

+* generate normalization data files

+ cd $ICU_ROOT/dbg

+ bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource

+ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

+ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

+ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

+ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

+* build ICU (make install)

+ so that the tools build can pick up the new definitions from the installed header files.

+ $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt

+* build Unicode tools using CMake+make

+~/svn.icutools/trunk/src/unicode/c/icudefs.txt:

+ # Location (--prefix) of where ICU was installed.

+ set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)

+ # Location of the ICU source tree.

+ set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)

+ ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c

+ ~/svn.icutools/trunk/dbg/unicode/c$ make

+* generate core properties data files

+- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR

- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR

- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR

+- rebuild ICU (make install) & tools

+- run genuca again (see step above) so that it picks up the new nfc.nrm

+- rebuild ICU (make install) & tools

+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

+- Unicode 6.0..8.0: U+2260, U+226E, U+226F

+- nothing new in 8.0, no test file to update

+* run & fix ICU4C tests

+- bad Cherokee case folding due to difference in fallbacks:

+ UCD case folding falls back to no mapping,

+ ICU runtime case folding falls back to lowercasing;

+ fixed casepropsbuilder.cpp to generate scf mappings to self

+ when there is an slc mapping but no scf

+- Andy handles RBBI & spoof check test failures

+* collation: CLDR collation root, UCA DUCET

+- UCA DUCET goes into Mark's Unicode tools, see

+ https://sites.google.com/site/unicodetools/home#TOC-UCA

+- CLDR root data files are checked into (CLDR UCA branch)/common/uca/

+- cd (CLDR UCA branch)/common/uca/

+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

+ cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt

+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

+ cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt

+ (note removing the underscore before "Rules")

+ cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

+- restore TODO diffs in UCARules.txt

+ meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

+- update (ICU4C)/source/test/testdata/CollationTest_*.txt

+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

+ from the CLDR root files (..._CLDR_..._SHORT.txt)

+ cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

+ cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

+ cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data

+- if CLDR common/uca/unihan-index.txt changes, then update

+ CLDR common/collation/root.xml <collation type="private-unihan">

+ and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt

+- run genuca, see command line above;

+ deal with

+ Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt

+ (add the character to genuca.cpp sampleCharsToScripts[])

+ + look up the script for the new sample characters

+ (e.g., in FractionalUCA.txt)

+ + *add* mappings to sampleCharsToScripts[], do not replace them

+ (in case the script sample characters flip-flop)

+ + insert new scripts in DUCET script order, see the top_byte table

+ at the beginning of FractionalUCA.txt

+- rebuild ICU4C

+* run & fix ICU4C tests, now with new CLDR collation root data

+- run all tests with the collation test data *_SHORT.txt or the full files

+ (the full ones have comments, useful for debugging)

+- note on intltest: if collate/UCAConformanceTest fails, then

+ utility/MultithreadTest/TestCollators will fail as well;

+ fix the conformance test before looking into the multi-thread test

+- fixed bug in CollationWeights::getWeightRanges()

+ exposed by new data and CollationTest::TestRootElements

+* update Java data files

+- refresh just the UCD/UCA-related/derived files, just to be safe

+- see (ICU4C)/source/data/icu4j-readme.txt

+- mkdir /tmp/icu4j

+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

+ output:

+ ...

+ Unicode .icu files built to ./out/build/icudt56l

+ echo timestamp > uni-core-data

+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b

+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b

+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b

+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"

+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/

+ mkdir -p /tmp/icu4j/main/shared/data

+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/

+ mkdir -p /tmp/icu4j/main/shared/data

+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

+ make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'

+- copy the big-endian Unicode data files to another location,

+ separate from the other data files,

+ and then refresh ICU4J

+ cd ~/svn.icu/trunk/dbg/data/out/icu4j

+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

+ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

+* When refreshing all of ICU4J data from ICU4C

+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

+or

+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

+* update CollationFCD.java

+ + copy & paste the initializers of lcccIndex[] etc. from

+ ICU4C/source/i18n/collationfcd.cpp to

+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

+* refresh Java test .txt files

+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

+ cd $ICU_SRC_DIR/source/data/unidata

+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

+ cd ../../test/testdata

+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

+ cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

+* run & fix ICU4J tests

+*** LayoutEngine script information

+* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,

+ because the layout engine was deprecated in ICU 54.

+ Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java

+ to write lines that we used to add manually.

+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.

+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp

+ in the working directory.

+ (It also generates ScriptRunData.cpp, which is no longer needed.)

+ It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages

+ (a plain text file)

+ which maps ICU versions to the numbers of script/language constants

+ that were added then.

+ (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)

+ The generated files have a current copyright date and "@deprecated" statement.

+* Review changes, fix Java tool if necessary, and copy to ICU4C

+ cd ~/svn.icu4j/trunk/src

+ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout

+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout

+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout

+*** API additions

+- send notice to icu-design about new born-@stable API (enum constants etc.)

+*** merge the Unicode update branches back onto the trunk

+- do not merge the icudata.jar and testdata.jar,

+ instead rebuild them from merged & tested ICU4C

+- make sure that changes to Unicode tools & ICU tools are checked in

+ http://www.unicode.org/utility/trac/log/trunk/unicodetools

+ http://bugs.icu-project.org/trac/log/tools/trunk

---------------------------------------------------------------------------- ***

« no previous file with comments | « source/data/unidata/UnicodeData.txt ('k') | source/data/unidata/confusablesWholeScript.txt » ('j') | no next file with comments »