icu46/source/i18n/rbt.h - Issue 5516007: Check in the pristine copy of ICU 4.6...

Unified Diff: icu46/source/i18n/rbt.h

Issue 5516007: Check in the pristine copy of ICU 4.6... (Closed) Base URL: svn://chrome-svn/chrome/trunk/deps/third_party/

Patch Set: Created 10 years ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: icu46/source/i18n/rbt.h

===================================================================

--- icu46/source/i18n/rbt.h (revision 0)

+++ icu46/source/i18n/rbt.h (revision 0)

@@ -0,0 +1,473 @@

+/*

+**********************************************************************

+* Date Name Description

+* 11/17/99 aliu Creation.

+**********************************************************************

+*/

+#ifndef RBT_H

+#define RBT_H

+#include "unicode/utypes.h"

+#if !UCONFIG_NO_TRANSLITERATION

+#include "unicode/translit.h"

+#include "unicode/utypes.h"

+#include "unicode/parseerr.h"

+#include "unicode/udata.h"

+#define U_ICUDATA_TRANSLIT U_ICUDATA_NAME U_TREE_SEPARATOR_STRING "translit"

+U_NAMESPACE_BEGIN

+class TransliterationRuleData;

+/**

+ * <code>RuleBasedTransliterator</code> is a transliterator

+ * that reads a set of rules in order to determine how to perform

+ * translations. Rule sets are stored in resource bundles indexed by

+ * name. Rules within a rule set are separated by semicolons (';').

+ * To include a literal semicolon, prefix it with a backslash ('\').

+ * Whitespace, as defined by <code>Character.isWhitespace()</code>,

+ * is ignored. If the first non-blank character on a line is '#',

+ * the entire line is ignored as a comment.

+ *

+ * Each set of rules consists of two groups, one forward, and one

+ * reverse. This is a convention that is not enforced; rules for one

+ * direction may be omitted, with the result that translations in

+ * that direction will not modify the source text. In addition,

+ * bidirectional forward-reverse rules may be specified for

+ * symmetrical transformations.

+ *

+ * Rule syntax

+ *

+ * Rule statements take one of the following forms:

+ *

+ * <dl>

+ * <dt><code>$alefmadda=\u0622;</code></dt>

+ * <dd>Variable definition. The name on the

+ * left is assigned the text on the right. In this example,

+ * after this statement, instances of the left hand name,

+ * "<code>$alefmadda</code>", will be replaced by

+ * the Unicode character U+0622. Variable names must begin

+ * with a letter and consist only of letters, digits, and

+ * underscores. Case is significant. Duplicate names cause

+ * an exception to be thrown, that is, variables cannot be

+ * redefined. The right hand side may contain well-formed

+ * text of any length, including no text at all ("<code>$empty=;</code>").

+ * The right hand side may contain embedded <code>UnicodeSet</code>

+ * patterns, for example, "<code>$softvowel=[eiyEIY]</code>".</dd>

+ * <dd> </dd>

+ * <dt><code>ai>$alefmadda;</code></dt>

+ * <dd>Forward translation rule. This rule

+ * states that the string on the left will be changed to the

+ * string on the right when performing forward

+ * transliteration.</dd>

+ * <dt> </dt>

+ * <dt><code>ai<$alefmadda;</code></dt>

+ * <dd>Reverse translation rule. This rule

+ * states that the string on the right will be changed to

+ * the string on the left when performing reverse

+ * transliteration.</dd>

+ * </dl>

+ *

+ * <dl>

+ * <dt><code>ai<>$alefmadda;</code></dt>

+ * <dd>Bidirectional translation rule. This

+ * rule states that the string on the right will be changed

+ * to the string on the left when performing forward

+ * transliteration, and vice versa when performing reverse

+ * transliteration.</dd>

+ * </dl>

+ *

+ * Translation rules consist of a match pattern and an output

+ * string. The match pattern consists of literal characters,

+ * optionally preceded by context, and optionally followed by

+ * context. Context characters, like literal pattern characters,

+ * must be matched in the text being transliterated. However, unlike

+ * literal pattern characters, they are not replaced by the output

+ * text. For example, the pattern "<code>abc{def}</code>"

+ * indicates the characters "<code>def</code>" must be

+ * preceded by "<code>abc</code>" for a successful match.

+ * If there is a successful match, "<code>def</code>" will

+ * be replaced, but not "<code>abc</code>". The final '<code>}</code>'

+ * is optional, so "<code>abc{def</code>" is equivalent to

+ * "<code>abc{def}</code>". Another example is "<code>{123}456</code>"

+ * (or "<code>123}456</code>") in which the literal

+ * pattern "<code>123</code>" must be followed by "<code>456</code>".

+ *

+ *

+ * The output string of a forward or reverse rule consists of

+ * characters to replace the literal pattern characters. If the

+ * output string contains the character '<code>|</code>', this is

+ * taken to indicate the location of the cursor after

+ * replacement. The cursor is the point in the text at which the

+ * next replacement, if any, will be applied. The cursor is usually

+ * placed within the replacement text; however, it can actually be

+ * placed into the precending or following context by using the

+ * special character '<code>@</code>'. Examples:

+ *

+ * <blockquote>

+ * <code>a {foo} z > | @ bar; # foo -> bar, move cursor

+ * before a

+ * {foo} xyz > bar @@|; # foo -> bar, cursor between

+ * y and z</code>

+ * </blockquote>

+ *

+ * UnicodeSet

+ *

+ * <code>UnicodeSet</code> patterns may appear anywhere that

+ * makes sense. They may appear in variable definitions.

+ * Contrariwise, <code>UnicodeSet</code> patterns may themselves

+ * contain variable references, such as "<code>$a=[a-z];$not_a=[^$a]</code>",

+ * or "<code>$range=a-z;$ll=[$range]</code>".

+ *

+ * <code>UnicodeSet</code> patterns may also be embedded directly

+ * into rule strings. Thus, the following two rules are equivalent:

+ *

+ * <blockquote>

+ * <code>$vowel=[aeiou]; $vowel>'*'; # One way to do this

+ * [aeiou]>'*';

+ *                #

+ * Another way</code>

+ * </blockquote>

+ *

+ * See {@link UnicodeSet} for more documentation and examples.

+ *

+ * Segments

+ *

+ * Segments of the input string can be matched and copied to the

+ * output string. This makes certain sets of rules simpler and more

+ * general, and makes reordering possible. For example:

+ *

+ * <blockquote>

+ * <code>([a-z]) > $1 $1;

+ *           #

+ * double lowercase letters

+ * ([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs</code>

+ * </blockquote>

+ *

+ * The segment of the input string to be copied is delimited by

+ * "<code>(</code>" and "<code>)</code>". Up to

+ * nine segments may be defined. Segments may not overlap. In the

+ * output string, "<code>$1</code>" through "<code>$9</code>"

+ * represent the input string segments, in left-to-right order of

+ * definition.

+ *

+ * Anchors

+ *

+ * Patterns can be anchored to the beginning or the end of the text. This is done with the

+ * special characters '<code>^</code>' and '<code>$</code>'. For example:

+ *

+ * <blockquote>

+ * <code>^ a   > 'BEG_A';   # match 'a' at start of text

+ *   a   > 'A';       # match other instances

+ * of 'a'

+ *   z $ > 'END_Z';   # match 'z' at end of text

+ *   z   > 'Z';       # match other instances

+ * of 'z'</code>

+ * </blockquote>

+ *

+ * It is also possible to match the beginning or the end of the text using a <code>UnicodeSet</code>.

+ * This is done by including a virtual anchor character '<code>$</code>' at the end of the

+ * set pattern. Although this is usually the match chafacter for the end anchor, the set will

+ * match either the beginning or the end of the text, depending on its placement. For

+ * example:

+ *

+ * <blockquote>

+ * <code>$x = [a-z$];   # match 'a' through 'z' OR anchor

+ * $x 1    > 2;   # match '1' after a-z or at the start

+ *    3 $x > 4;   # match '3' before a-z or at the end</code>

+ * </blockquote>

+ *

+ * Example

+ *

+ * The following example rules illustrate many of the features of

+ * the rule language.

+ *

+ * <table border="0" cellpadding="4">

+ * <tr>

+ * <td valign="top">Rule 1.</td>

+ * <td valign="top" nowrap><code>abc{def}>x|y</code></td>

+ * </tr>

+ * <tr>

+ * <td valign="top">Rule 2.</td>

+ * <td valign="top" nowrap><code>xyz>r</code></td>

+ * </tr>

+ * <tr>

+ * <td valign="top">Rule 3.</td>

+ * <td valign="top" nowrap><code>yz>q</code></td>

+ * </tr>

+ * </table>

+ *

+ * Applying these rules to the string "<code>adefabcdefz</code>"

+ * yields the following results:

+ *

+ * <table border="0" cellpadding="4">

+ * <tr>

+ * <td valign="top" nowrap><code>|adefabcdefz</code></td>

+ * <td valign="top">Initial state, no rules match. Advance

+ * cursor.</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>a|defabcdefz</code></td>

+ * <td valign="top">Still no match. Rule 1 does not match

+ * because the preceding context is not present.</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>ad|efabcdefz</code></td>

+ * <td valign="top">Still no match. Keep advancing until

+ * there is a match...</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>ade|fabcdefz</code></td>

+ * <td valign="top">...</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>adef|abcdefz</code></td>

+ * <td valign="top">...</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>adefa|bcdefz</code></td>

+ * <td valign="top">...</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>adefab|cdefz</code></td>

+ * <td valign="top">...</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>adefabc|defz</code></td>

+ * <td valign="top">Rule 1 matches; replace "<code>def</code>"

+ * with "<code>xy</code>" and back up the cursor

+ * to before the '<code>y</code>'.</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>adefabcx|yz</code></td>

+ * <td valign="top">Although "<code>xyz</code>" is

+ * present, rule 2 does not match because the cursor is

+ * before the '<code>y</code>', not before the '<code>x</code>'.

+ * Rule 3 does match. Replace "<code>yz</code>"

+ * with "<code>q</code>".</td>

+ * </tr>

+ * <tr>

+ * <td valign="top" nowrap><code>adefabcxq|</code></td>

+ * <td valign="top">The cursor is at the end;

+ * transliteration is complete.</td>

+ * </tr>

+ * </table>

+ *

+ * The order of rules is significant. If multiple rules may match

+ * at some point, the first matching rule is applied.

+ *

+ * Forward and reverse rules may have an empty output string.

+ * Otherwise, an empty left or right hand side of any statement is a

+ * syntax error.

+ *

+ * Single quotes are used to quote any character other than a

+ * digit or letter. To specify a single quote itself, inside or

+ * outside of quotes, use two single quotes in a row. For example,

+ * the rule "<code>'>'>o''clock</code>" changes the

+ * string "<code>></code>" to the string "<code>o'clock</code>".

+ *

+ *

+ * Notes

+ *

+ * While a RuleBasedTransliterator is being built, it checks that

+ * the rules are added in proper order. For example, if the rule

+ * "a>x" is followed by the rule "ab>y",

+ * then the second rule will throw an exception. The reason is that

+ * the second rule can never be triggered, since the first rule

+ * always matches anything it matches. In other words, the first

+ * rule masks the second rule.

+ *

+ * @author Alan Liu

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+class RuleBasedTransliterator : public Transliterator {

+private:

+ /**

+ * The data object is immutable, so we can freely share it with

+ * other instances of RBT, as long as we do NOT own this object.

+ * TODO: data is no longer immutable. See bugs #1866, 2155

+ */

+ TransliterationRuleData* fData;

+ /**

+ * If true, we own the data object and must delete it.

+ */

+ UBool isDataOwned;

+public:

+ /**

+ * Constructs a new transliterator from the given rules.

+ * @param rules rules, separated by ';'

+ * @param direction either FORWARD or REVERSE.

+ * @exception IllegalArgumentException if rules are malformed.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ RuleBasedTransliterator(const UnicodeString& id,

+ const UnicodeString& rules,

+ UTransDirection direction,

+ UnicodeFilter* adoptedFilter,

+ UParseError& parseError,

+ UErrorCode& status);

+ /**

+ * Constructs a new transliterator from the given rules.

+ * @param rules rules, separated by ';'

+ * @param direction either FORWARD or REVERSE.

+ * @exception IllegalArgumentException if rules are malformed.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ /*RuleBasedTransliterator(const UnicodeString& id,

+ const UnicodeString& rules,

+ UTransDirection direction,

+ UnicodeFilter* adoptedFilter,

+ UErrorCode& status);*/

+ /**

+ * Covenience constructor with no filter.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ /*RuleBasedTransliterator(const UnicodeString& id,

+ const UnicodeString& rules,

+ UTransDirection direction,

+ UErrorCode& status);*/

+ /**

+ * Covenience constructor with no filter and FORWARD direction.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ /*RuleBasedTransliterator(const UnicodeString& id,

+ const UnicodeString& rules,

+ UErrorCode& status);*/

+ /**

+ * Covenience constructor with FORWARD direction.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ /*RuleBasedTransliterator(const UnicodeString& id,

+ const UnicodeString& rules,

+ UnicodeFilter* adoptedFilter,

+ UErrorCode& status);*/

+private:

+ friend class TransliteratorRegistry; // to access TransliterationRuleData convenience ctor

+ /**

+ * Covenience constructor.

+ * @param id the id for the transliterator.

+ * @param theData the rule data for the transliterator.

+ * @param adoptedFilter the filter for the transliterator

+ */

+ RuleBasedTransliterator(const UnicodeString& id,

+ const TransliterationRuleData* theData,

+ UnicodeFilter* adoptedFilter = 0);

+ friend class Transliterator; // to access following ct

+ /**

+ * Internal constructor.

+ * @param id the id for the transliterator.

+ * @param theData the rule data for the transliterator.

+ * @param isDataAdopted determine who will own the 'data' object. True, the caller should not delete 'data'.

+ */

+ RuleBasedTransliterator(const UnicodeString& id,

+ TransliterationRuleData* data,

+ UBool isDataAdopted);

+public:

+ /**

+ * Copy constructor.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ RuleBasedTransliterator(const RuleBasedTransliterator&);

+ virtual ~RuleBasedTransliterator();

+ /**

+ * Implement Transliterator API.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ virtual Transliterator* clone(void) const;

+protected:

+ /**

+ * Implements {@link Transliterator#handleTransliterate}.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ virtual void handleTransliterate(Replaceable& text, UTransPosition& offsets,

+ UBool isIncremental) const;

+public:

+ /**

+ * Return a representation of this transliterator as source rules.

+ * These rules will produce an equivalent transliterator if used

+ * to construct a new transliterator.

+ * @param result the string to receive the rules. Previous

+ * contents will be deleted.

+ * @param escapeUnprintable if TRUE then convert unprintable

+ * character to their hex escape representations, \uxxxx or

+ * \Uxxxxxxxx. Unprintable characters are those other than

+ * U+000A, U+0020..U+007E.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ virtual UnicodeString& toRules(UnicodeString& result,

+ UBool escapeUnprintable) const;

+protected:

+ /**

+ * Implement Transliterator framework

+ */

+ virtual void handleGetSourceSet(UnicodeSet& result) const;

+public:

+ /**

+ * Override Transliterator framework

+ */

+ virtual UnicodeSet& getTargetSet(UnicodeSet& result) const;

+ /**

+ * Return the class ID for this class. This is useful only for

+ * comparing to a return value from getDynamicClassID(). For example:

+ * <pre>

+ * . Base* polymorphic_pointer = createPolymorphicObject();

+ * . if (polymorphic_pointer->getDynamicClassID() ==

+ * . Derived::getStaticClassID()) ...

+ * </pre>

+ * @return The class ID for all objects of this class.

+ * @internal Use transliterator factory methods instead since this class will be removed in that release.

+ */

+ U_I18N_API static UClassID U_EXPORT2 getStaticClassID(void);

+ /**

+ * Returns a unique class ID polymorphically. This method

+ * is to implement a simple version of RTTI, since not all C++

+ * compilers support genuine RTTI. Polymorphic operator==() and

+ * clone() methods call this method.

+ *

+ * @return The class ID for this object. All objects of a given

+ * class have the same class ID. Objects of other classes have

+ * different class IDs.

+ */

+ virtual UClassID getDynamicClassID(void) const;

+private:

+ void _construct(const UnicodeString& rules,

+ UTransDirection direction,

+ UParseError& parseError,

+ UErrorCode& status);

+};

+U_NAMESPACE_END

+#endif /* #if !UCONFIG_NO_TRANSLITERATION */

+#endif

Property changes on: icu46/source/i18n/rbt.h

___________________________________________________________________

Added: svn:eol-style

+ LF

« no previous file with comments | « icu46/source/i18n/rbnf.cpp ('k') | icu46/source/i18n/rbt.cpp » ('j') | no next file with comments »