Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(609)

Unified Diff: source/data/unidata/changes.txt

Issue 1621843002: ICU 56 update step 1 (Closed) Base URL: https://chromium.googlesource.com/chromium/deps/icu.git@561
Patch Set: Created 4 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « source/data/unidata/UnicodeData.txt ('k') | source/data/unidata/confusablesWholeScript.txt » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: source/data/unidata/changes.txt
diff --git a/source/data/unidata/changes.txt b/source/data/unidata/changes.txt
index 23f29bf2e3b88993a78621a62378fdf613a087c6..5febb0745ed01768dab97de15ec55234a8b37cfb 100644
--- a/source/data/unidata/changes.txt
+++ b/source/data/unidata/changes.txt
@@ -1,4 +1,4 @@
-* Copyright (C) 2004-2014, International Business Machines
+* Copyright (C) 2004-2015, International Business Machines
* Corporation and others. All Rights Reserved.
*
* file name: changes.txt
@@ -13,36 +13,338 @@
---------------------------------------------------------------------------- ***
-Unicode 8.0 update for ICU ??
+* New ISO 15924 script codes
-* UCA issue from 7.0
+Starting with ICU 55, we do not add UScriptCode constants any more until their scripts
+are encoded in Unicode, or can be assumed to be encoded in the next Unicode version.
+Script enum constant names want to follow the Unicode script property value aliases,
+which are assigned only when the scripts are encoded.
+When we encode scripts early and guess wrong, then we have confusing enum constants
+and have sometimes added aliases.
-- U+1DE9 COMBINING LATIN SMALL LETTER BETA
- sorts with Greek Beta, should sort with Latin B?
- + Ken says:
- No, it was deliberate:
+Exception: Script codes like Latf and Aran that are not subject to separate encoding
+can be added at any time.
- 03B2;GREEK SMALL LETTER BETA;Ll;;;;0392;;0392
- 1D5D;MODIFIER LETTER SMALL BETA;Lm;<super> 03B2;;;;;
- 1DE9;COMBINING LATIN SMALL LETTER BETA;Mn;<sort> 03B2;;;;;
- 1D66;GREEK SUBSCRIPT SMALL LETTER BETA;Ll;<sub> 03B2;;;;;
+Script codes not yet in ICU: http://www.unicode.org/iso15924/codechanges.html
- Note the relationship to U+1D5D.
+Added 2014-11-15, see http://bugs.icu-project.org/trac/ticket/11561
+- Adlm 166 Adlam
+- Aran 161 Arabic (Nastaliq variant)
+- Kitl 505 Khitan large script
+- Kits 288 Khitan small script
+- Marc 332 Marchen
+- Osge 219 Osage
- When the disunified *Latin* beta base letter shows up in Unicode 8.0:
+Aran can be added as USCRIPT_ARABIC_NASTALIQ at any time.
- U+A7B4 LATIN CAPITAL LETTER BETA
- U+A7B5 LATIN SMALL LETTER BETA
+Adlam, Marchen, and Osage are expected to go into Unicode 9;
+we should assign Unicode script property value aliases for them
+soon after Unicode 8 is released, and add them in ICU 56.
- we could re-evaluate what U+1DE9 equates to, for collation,
- but currently there isn’t any Latin beta to serve that function
- in Unicode 7.0.
+Khitan scripts will be encoded later.
-- ICU_ROOT=~/svn.icu/trunk
-- ICU_SRC_DIR=$ICU_ROOT/src
+---------------------------------------------------------------------------- ***
+
+Unicode 8.0 update for ICU 56
+
+* Command-line environment setup
+
+ICU_ROOT=~/svn.icu/trunk
+ICU_SRC_DIR=$ICU_ROOT/src
+ICUDT=icudt56b
+export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
+SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
+UNIDATA=$ICU_SRC_DIR/source/data/unidata
+
+http://www.unicode.org/review/pri297/ -- beta review
+http://www.unicode.org/reports/uax-proposed-updates.html
+http://unicode.org/versions/beta-8.0.0.html
+http://www.unicode.org/versions/Unicode8.0.0/
+http://www.unicode.org/reports/tr44/tr44-15.html
+
+*** ICU Trac
+
+- ticket:11574: Unicode 8
+- C++ branches/markus/uni80 at r37351 from trunk at r37343
+- Java branches/markus/uni80 at r37352 from trunk at r37338
+
+*** CLDR Trac
+
+- cldrbug 8311: UCA 8
+- branches/markus/uni80 at r11518 from trunk at r11517
+
+- cldrbug 8109: Unicode 8.0 script metadata
+- cldrbug 8418: Updated segmentation for Unicode 8.0
+
+*** Unicode version numbers
+- makedata.mak
+- uchar.h
+- com.ibm.icu.util.VersionInfo
+- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
+
+- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
+ so that the makefiles see the new version number.
+
+*** data files & enums & parser code
+
+* file preparation
+
+- download UCD & IDNA files
+- make sure that the Unicode data folder passed into preparseucd.py
+ includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
+- only for manual diffs: remove version suffixes from the file names
+ ~/unidata/uni70/20140403$ ../../desuffixucd.py .
+ (see https://sites.google.com/site/unicodetools/inputdata)
+- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
+- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
+- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
+
+- also: from http://unicode.org/Public/security/8.0.0/ download new
+ confusables.txt & confusablesWholeScript.txt
+ and copy to $UNIDATA
+ ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
+ ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
+
+* initial preparseucd.py changes
+- remove new Unicode scripts from the
+ only-in-ISO-15924 list according to the error message:
+ ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
+ from _scripts_only_in_iso15924
+ -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
+ and in com.ibm.icu.dev.test.lang.TestUScript.java
+- property and file name change:
+ IndicMatraCategory -> IndicPositionalCategory
+- UnicodeData.txt unusual numeric values (improper fractions)
+ 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
+ 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
+ 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
+ 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
+ 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
+ 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
+ 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
+ 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
+ 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
+ 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
+ -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
+ which are listed in DerivedNumericValues.txt;
+ keeps storage in data file simple
+
+* PropertyValueAliases.txt changes
+- 10 new Block (blk) values:
+ blk; Ahom ; Ahom
+ blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
+ blk; Cherokee_Sup ; Cherokee_Supplement
+ blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
+ blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
+ blk; Hatran ; Hatran
+ blk; Multani ; Multani
+ blk; Old_Hungarian ; Old_Hungarian
+ blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
+ blk; Sutton_SignWriting ; Sutton_SignWriting
+ -> add to uchar.h
+ use long property names for enum constants
+ -> add to UCharacter.UnicodeBlock IDs
+ Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
+ replace public static final int \1_ID = \2; \3
+ -> add to UCharacter.UnicodeBlock objects
+ Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
+ replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
+- 6 new Script (sc) values:
+ sc ; Ahom ; Ahom
+ sc ; Hatr ; Hatran
+ sc ; Hluw ; Anatolian_Hieroglyphs
+ sc ; Hung ; Old_Hungarian
+ sc ; Mult ; Multani
+ sc ; Sgnw ; SignWriting
+ -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
+
+* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
+ (not strictly necessary for NOT_ENCODED scripts)
+ ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
+
+* generate normalization data files
+ cd $ICU_ROOT/dbg
+ bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
+ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
+ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
+ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
+
+* build ICU (make install)
+ so that the tools build can pick up the new definitions from the installed header files.
+
+ $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
+
+* build Unicode tools using CMake+make
+
+~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
+
+ # Location (--prefix) of where ICU was installed.
+ set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
+ # Location of the ICU source tree.
+ set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
+
+ ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
+ ~/svn.icutools/trunk/dbg/unicode/c$ make
+
+* generate core properties data files
+- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
+- rebuild ICU (make install) & tools
+- run genuca again (see step above) so that it picks up the new nfc.nrm
+- rebuild ICU (make install) & tools
+
+* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
+ sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
+- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
+- Unicode 6.0..8.0: U+2260, U+226E, U+226F
+- nothing new in 8.0, no test file to update
+
+* run & fix ICU4C tests
+- bad Cherokee case folding due to difference in fallbacks:
+ UCD case folding falls back to no mapping,
+ ICU runtime case folding falls back to lowercasing;
+ fixed casepropsbuilder.cpp to generate scf mappings to self
+ when there is an slc mapping but no scf
+- Andy handles RBBI & spoof check test failures
+
+* collation: CLDR collation root, UCA DUCET
+
+- UCA DUCET goes into Mark's Unicode tools, see
+ https://sites.google.com/site/unicodetools/home#TOC-UCA
+- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
+- cd (CLDR UCA branch)/common/uca/
+- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
+ cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
+- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
+ cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
+ (note removing the underscore before "Rules")
+ cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- restore TODO diffs in UCARules.txt
+ meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
+- update (ICU4C)/source/test/testdata/CollationTest_*.txt
+ and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
+ from the CLDR root files (..._CLDR_..._SHORT.txt)
+ cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
+ cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
+ cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
+- if CLDR common/uca/unihan-index.txt changes, then update
+ CLDR common/collation/root.xml <collation type="private-unihan">
+ and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
+- run genuca, see command line above;
+ deal with
+ Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
+ (add the character to genuca.cpp sampleCharsToScripts[])
+ + look up the script for the new sample characters
+ (e.g., in FractionalUCA.txt)
+ + *add* mappings to sampleCharsToScripts[], do not replace them
+ (in case the script sample characters flip-flop)
+ + insert new scripts in DUCET script order, see the top_byte table
+ at the beginning of FractionalUCA.txt
+- rebuild ICU4C
+
+* run & fix ICU4C tests, now with new CLDR collation root data
+- run all tests with the collation test data *_SHORT.txt or the full files
+ (the full ones have comments, useful for debugging)
+- note on intltest: if collate/UCAConformanceTest fails, then
+ utility/MultithreadTest/TestCollators will fail as well;
+ fix the conformance test before looking into the multi-thread test
+- fixed bug in CollationWeights::getWeightRanges()
+ exposed by new data and CollationTest::TestRootElements
+
+* update Java data files
+- refresh just the UCD/UCA-related/derived files, just to be safe
+- see (ICU4C)/source/data/icu4j-readme.txt
+- mkdir /tmp/icu4j
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+ output:
+ ...
+ Unicode .icu files built to ./out/build/icudt56l
+ echo timestamp > uni-core-data
+ mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
+ mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
+ echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
+ LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
+ mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
+ jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
+ jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
+ mkdir -p /tmp/icu4j/main/shared/data
+ cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
+ make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
+- copy the big-endian Unicode data files to another location,
+ separate from the other data files,
+ and then refresh ICU4J
+ cd ~/svn.icu/trunk/dbg/data/out/icu4j
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
+ cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
+ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
+ cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
+ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
+* When refreshing all of ICU4J data from ICU4C
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
+- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
+or
+- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
+
+* update CollationFCD.java
+ + copy & paste the initializers of lcccIndex[] etc. from
+ ICU4C/source/i18n/collationfcd.cpp to
+ ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
+
+* refresh Java test .txt files
+- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd $ICU_SRC_DIR/source/data/unidata
+ cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cd ../../test/testdata
+ cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+ cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
+
+* run & fix ICU4J tests
+
+*** LayoutEngine script information
+
+* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
+ because the layout engine was deprecated in ICU 54.
+ Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
+ to write lines that we used to add manually.
+
+* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
+ This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
+ in the working directory.
+
+ (It also generates ScriptRunData.cpp, which is no longer needed.)
+
+ It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
+ (a plain text file)
+ which maps ICU versions to the numbers of script/language constants
+ that were added then.
+ (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
+
+ The generated files have a current copyright date and "@deprecated" statement.
+
+* Review changes, fix Java tool if necessary, and copy to ICU4C
+ cd ~/svn.icu4j/trunk/src
+ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
+ cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
+
+*** API additions
+- send notice to icu-design about new born-@stable API (enum constants etc.)
+
+*** merge the Unicode update branches back onto the trunk
+- do not merge the icudata.jar and testdata.jar,
+ instead rebuild them from merged & tested ICU4C
+- make sure that changes to Unicode tools & ICU tools are checked in
+ http://www.unicode.org/utility/trac/log/trunk/unicodetools
+ http://bugs.icu-project.org/trac/log/tools/trunk
---------------------------------------------------------------------------- ***
« no previous file with comments | « source/data/unidata/UnicodeData.txt ('k') | source/data/unidata/confusablesWholeScript.txt » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698