recipes/src/core/strings.md - Issue 12335109: Strings recipes for the Dart Cookbook

Unified Diff: recipes/src/core/strings.md

Issue 12335109: Strings recipes for the Dart Cookbook (Closed) Base URL: https://github.com/dart-lang/cookbook.git@master

Patch Set: Numerous minor changes based on reviewers' comments. Created 7 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: recipes/src/core/strings.md

diff --git a/recipes/src/core/strings.md b/recipes/src/core/strings.md

new file mode 100644

index 0000000000000000000000000000000000000000..4d2531a43cc81d5bd09ea9c0e9ba956518e2a2ff

--- /dev/null

+++ b/recipes/src/core/strings.md

@@ -0,0 +1,654 @@

+# Strings

+A Dart string represents a sequence of characters encoded in UTF-16. Decoding

+UTF-16 yields Unicode code points. Borrowing terminology from Go, Dart uses

+the term `rune` for an integer representing a Unicode code point.

+The string recipes included in this chapter assume that you have some

+familiarity with Unicode and UTF-16. Here is a brief refresher:

+### What is the Basic Multilingual Plane?

+The Unicode code space is divided into seventeen planes of 65,536 points each.

+The first plane (code points U+0000 to U+FFFF) contains the most

+frequently used characters and is called the Basic Multilingual Plane or BMP.

+### What is a Surrogate Pair?

+The term 'surrogate pair' refers to a means of encoding Unicode characters

+outside the Basic Multilingual Plane.

+In UTF-16, two-byte (16-bit) code sequences are used to store Unicode

+characters. Since two bytes can only contain the 65,536 characters in the 0x0

+to 0xFFFF range, a pair of code points are used to store values in the

+0x10000 to 0x10FFFF range.

+For example the Unicode character for musical Treble-clef (🎼 ), with

+a value of '\u{1F3BC}', it too large to fit in 16 bits.

+ var clef = '\u{1F3BC}'; // 🎼

+'\u{1F3BC}' is composed of a UTF-16 surrogate pair: [u\D83C, \uDFBC].

+### What is the difference between a code point and a code unit?

+Within the Basic Multilingual Plane, the code point for a character is

+numerically the same as the code unit for that character.

+ 'D'.runes.first; // 68

+ 'D'.codeUnits.first; // 68

+For non-BMP characters, each code point is represented by two code units.

+ var clef = '\u{1F3BC}'; // 🎼

+ clef.runes.length; // 1

+ clef.codeUnits.length; // 2

+### What exactly is a character?

+A character is a string contained in the Universal Character Set. Each character

+maps to a single rune value (code point); BMP characters map to 1 code

+unit; non-BMP characters map to 2 code units.

+You can read more about the Universal Character Set at

+http://en.wikipedia.org/wiki/Universal_Character_Set.

+### Do I have to really deal with Unicode?

+Yes, if you want to build robust international applications, you do.

+Besides, the String library makes working with Unicode relatively painless,

+so there's no great overhead in doing things right.

+## Concatenating Strings

+### Problem

+You want to concatenate strings in Dart. You tried using `+`, but

+that resulted in an error.

+### Solution

+Use adjacent string literals:

+ var fact = 'Dart' 'is' ' fun!'; // 'Dart is fun!'

+### Discussion

+Adjacent literals also work over multiple lines:

+ var fact = 'Dart'

+ 'is'

+ 'fun!'; // 'Dart is fun!'

+They also work when using multiline strings:

+ var lunch = '''Peanut

+ butter'''

+ '''and

+ jelly'''; // 'Peanut\nbutter and\njelly'

+You can concatenate adjacent single line literals with multiline strings:

+ var funnyGuys = 'Dewey ' 'Cheatem'

+ ''' and

+ Howe'''; // 'Dewey Cheatem and\n Howe'

+#### Alternatives to adjacent string literals

+You can also use the `concat()` method on a string to concatenate it to another

floitsch 2013/03/09 00:01:41 I just gave an LGTM to Lasse for changing concat t

+string:

+ var film = filmToWatch();

+ film = film.concat('\n'); // 'The Big Lebowski\n'

+Since `concat()` creates a new string every time it is invoked, a long chain of

+`concat()`s can be expensive. Avoid those. Use a StringBuffer instead (see

+_Incrementally building a string efficiently using a StringBuffer_, below).

+Use can `join()` to combine a sequence of strings:

+ var film = ['The', 'Big', 'Lebowski']).join(' '); // 'The Big Lebowski'

+You can also use string interpolation to concatenate strings (see

+_Interpolating expressions inside strings_, below).

+## Interpolating expressions inside strings

+### Problem

+You want to create strings that contain Dart expressions and identifiers.

+### Solution

+You can put the value of an expression inside a string by using ${expression}.

+ var favFood = 'sushi';

+ var whatDoILove = 'I love ${favFood.toUpperCase()}'; // 'I love SUSHI'

+You can skip the {} if the expression is an identifier:

+ var whatDoILove = 'I love $favFood'; // 'I love sushi'

+### Discussion

+An interpolated string, `string ${expression}` is equivalent to the

+concatenation of the strings 'string ' and `expression.toString()`.

+Consider this code:

+ var four = 4;

+ var seasons = 'The $four seasons'; // 'The 4 seasons'

+It is equivalent to the following:

floitsch 2013/03/09 00:01:41 It is not. the concat will make two copies, wherea

+ var seasons = 'The '.concat(4.toString()).concat(' seasons'); // 'The 4 seasons'

+You should consider implementing a `toString()` method for user-defined

+objects. Here's what happens if you don't:

+ class Point {

+ num x, y;

+ Point(this.x, this.y);

+ }

+ var point = new Point(3, 4);

+ print('Point: $point'); // "Point: Instance of 'Point'"

+Probably not what you wanted. Here is the same example with an explicit

+`toString()`:

+ class Point {

+ ...

+ String toString() => 'x: $x, y: $y';

+ }

+ print('Point: $point'); // 'Point: x: 3, y: 4'

+## Escaping special characters

+### Problem

+You want to put newlines, dollar signs, or other special characters in your strings.

+### Solution

+Prefix special characters with a `\`.

+ print(Wile\nCoyote');

+ // Wile

+ // Coyote

+### Discussion

+Dart designates a few characters as special, and these can be escaped:

+- \n for newline, equivalent to \x0A.

+- \r for carriage return, equivalent to \x0D.

+- \f for form feed, equivalent to \x0C.

+- \b for backspace, equivalent to \x08.

+- \t for tab, equivalent to \x09.

+- \v for vertical tab, equivalent to \x0B.

+If you prefer, you can use `\x` or `\u` notation to indicate the special

+character:

+ print('Wile\x0ACoyote'); // same as print('Wile\nCoyote');

+ print('Wile\u000ACoyote'); // same as print('Wile\nCoyote');

+You can also use `\u{}` notation:

+ print('Wile\u{000A}Coyote'); // same as print('Wile\nCoyote');

+You can also escape the `$` used in string interpolation:

+ var superGenius = 'Wile Coyote';

+ print('$superGenius and Road Runner'); // 'Wile Coyote and Road Runner'

+ print('\$superGenius and Road Runner'); // '$superGenius and Road Runner'

+If you escape a non-special character, the `\` is ignored:

+ print('Wile \E Coyote'); // 'Wile E Coyote'

+## Incrementally building a string efficiently using a StringBuffer

+### Problem

+You want to collect string fragments and combine them in an efficient manner.

+### Solution

+Use a StringBuffer to programmatically generate a string. A StringBuffer

+collects the string fragments, but does not generate a new string until

+`toString()` is called:

+ var sb = new StringBuffer();

+ sb.write('John, ');

+ sb.write('Paul, ');

+ sb.write('George, ');

+ sb.write('and Ringo');

+ var beatles = sb.toString(); // 'John, Paul, George, and Ringo'

+### Discussion

+In addition to `write()`, the StringBuffer class provides methods to write a

+list of strings (`writeAll()`), write a numerical character code

+(`writeCharCode()`), write with an added newline ('writeln()`), and more. Here

+is a simple example that show the use of these methods:

+ var sb = new StringBuffer();

+ sb.writeln('The Beatles:');

+ sb.writeAll(['John, ', 'Paul, ', 'George, and Ringo']);

+ sb.writeCharCode(33); // charCode for '!'.

+ var beatles = sb.toString(); // 'The Beatles:\nJohn, Paul, George, and Ringo!'

+Since a StringBuffer waits until the call to `toString()` to generate the

+concatenated string, it represents a more efficient way of combining strings

+than `concat()`. See the _Concatenating Strings_ recipe for a description of

+`concat()`.

+## Converting between string characters and numerical codes

+### Problem

+You want to convert string characters into numerical codes and back.

+### Solution

+Use the `runes` getter to access a string's code points:

+ 'Dart'.runes.toList(); // [68, 97, 114, 116]

+ var smileyFace = '\u263A'; // ☺

+ smileyFace.runes.toList(); // [9786]

+The number 9786 represents the code unit '\u263A'.

+Use `string.codeUnits` to get a string's UTF-16 code units:

+ 'Dart'.codeUnits.toList(); // [68, 97, 114, 116]

+ smileyFace.codeUnits.toList(); // [9786]

+### Discussion

+Notice that using `runes` and `codeUnits` produces identical results

+in the examples above. That happens because each character in 'Dart' and in

+`smileyFace` fits within 16 bits, resulting in a code unit corresponding

+neatly with a code point.

+Consider an example where a character cannot be represented within 16-bits,

+the Unicode character for a Treble clef ('\u{1F3BC}'). This character consists

+of a surrogate pair: '\uD83C', '\uDFBC'. Getting the numerical value of this

+character using `codeUnits` and `runes` produces the following result:

+ var clef = '\u{1F3BC}'; // 🎼

+ clef.codeUnits.toList(); // [55356, 57276]

+ clef.runes.toList(); // [127932]

+The numbers 55356 and 57276 represent `clef`'s surrogate pair, '\uD83C' and

+'\uDFBC', respectively. The number 127932 represents the code point '\u1F3BC'.

+#### Using codeUnitAt() to access individual code units

+To access the 16-Bit UTF-16 code unit at a particular index, use

+`codeUnitAt()`:

+ 'Dart'.codeUnitAt(0); // 68

+ smileyFace.codeUnitAt(0); // 9786

+Using `codeUnitAt()` with the multi-byte `clef` character leads to problems:

+ clef.codeUnitAt(0); // 55356

+ clef.codeUnitAt(1); // 57276

+In either call to `clef.codeUnitAt()`, the values returned represent strings

+that are only one half of a UTF-16 surrogate pair. These are not valid UTF-16

+strings.

+#### Converting numerical codes to strings

+You can generate a new string from runes or code units using the factory

+`String.fromCharCodes(charCodes)`:

+ new String.fromCharCodes([68, 97, 114, 116]); // 'Dart'

+ new String.fromCharCodes([73, 32, 9825, 32, 76, 117, 99, 121]);

+ // 'I ♡ Lucy'

+ new String.fromCharCodes([55356, 57276]); // 🎼

+ new String.fromCharCodes([127932]), // 🎼

+You can use the `String.fromCharCode()` factory to convert a single rune or

+code unit to a string:

+ new String.fromCharCode(68); // 'D'

+ new String.fromCharCode(9786); // ☺

+ new String.fromCharCode(127932); // 🎼

+Creating a string with only one half of a surrogate pair is permitted, but not

+recommended.

+## Determining if a string is empty

+### Problem

+You want to know if a string is empty. You tried ` if(string) {...}`, but that

+did not work.

+### Solution

+Use `string.isEmpty`:

+ var emptyString = '';

+ emptyString.isEmpty; // true

+A string with a space is not empty:

+ var space = ' ';

+ space.isEmpty; // false

+### Discussion

+Don't use `if (string)` to test the emptiness of a string. In Dart, all

+objects except the boolean true evaluate to false. `if(string)` will always

+be false.

+## Removing leading and trailing whitespace

+### Problem

+You want to remove leading and trailing whitespace from a string.

+### Solution

+Use `string.trim()`:

+ var space = '\n\r\f\t\v'; // We'll use a variety of space characters.

+ var string = '$space X $space';

+ var newString = string.trim(); // 'X'

+The String class has no methods to remove only leading or only trailing

+whitespace. But you can always use regExps.

+Remove only leading whitespace:

+ var newString = string.replaceFirst(new RegExp(r'^\s+'), ''); // 'X $space'

+Remove only trailing whitespace:

+ var newString = string.replaceFirst(new RegExp(r'\s+$'), ''); // '$space X'

+## Calculating the length of a string

+### Problem

+You want to get the length of a string, but are not sure how to

+correctly calculate the length when working with Unicode.

+### Solution

+Use string.length to get the number of UTF-16 code units in a string:

+ 'I love music'.length; // 12

+ 'I love music'.runes.length; // 12

+### Discussion

+For characters that fit into 16 bits, the code unit length is the same as the

+rune length:

+ var hearts = '\u2661'; // ♡

+ hearts.length; // 1

+ hearts.runes.length; // 1

+If the string contains any characters outside the Basic Multilingual

+Plane (BMP), the rune length will be less than the code unit length:

+ var clef = '\u{1F3BC}'; // 🎼

+ clef.length; // 2

+ clef.runes.length; // 1

+ var music = 'I $hearts $clef'; // 'I ♡ 🎼 '

+ music.length; // 6

+ music.runes.length // 5

+Use `length` if you want to number of code units; use `runes.length` if you

+want the number of runes.

floitsch 2013/03/09 00:01:41 You could add, that Twitter uses runes for the len

+## Subscripting a string

+### Problem

+You want to be able to access a character in a string at a particular index.

+### Solution

+Subscript runes:

+ var teacup = '\u{1F375}'; // 🍵

+ teacup.runes.toList()[0]; // 127861

floitsch 2013/03/09 00:01:41 If you want to access it only once, you can also u

+The number 127861 represents the code point for teacup, '\u{1F375}' (🍵 ).

+### Discussion

+Subscripting a string directly can be problematic. This is because the default

+`[]` implementation subscripts along code units. This means that

+for non-BMP characters, subscripting yields invalid UTF-16 characters:

+ 'Dart'[0]; // 'D'

+ var hearts = '\u2661'; // ♡

+ hearts[0]; '\u2661' // ♡

+ teacup[0]; // 55356, Invalid string, half of a surrogate pair.

+ teacup.codeUnits.toList()[0]; // The same.

+## Processing a string one character at a time

+### Problem

+You want to do something with each individual character in a string.

+### Solution

+To access an individual character, map the string runes:

+ var charList = "Dart".runes.map((rune) => '*${new String.fromCharCode(rune)}*').toList();

+ // ['*D*', '*a*', '*r*', '*t*']

+ var runeList = happy.runes.map((rune) => [rune, new String.fromCharCode(rune)]).toList(),

+ // [[73, 'I'], [32, ' '], [97, 'a'], [109, 'm'], [32, ' '], [9786, '☺']]

+If you are sure that the string is in the Basic Multilingual Plane (BMP), you

+can use string.split(''):

+ 'Dart'.split(''); // ['D', 'a', 'r', 't']

+ smileyFace.split('').length; // 1

+Since `split('')` splits at the UTF-16 code unit boundaries,

+invoking it on a non-BMP character yields the string's surrogate pair:

+ var clef = '\u{1F3BC}'; // 🎼 , not in BMP.

+ clef.split('').length; // 2

+The surrogate pair members are not valid UTF-16 strings.

+## Splitting a string into substrings

+### Problem

+You want to split a string into substrings.

+### Solution

+Use the `split()` method with a string or a regExp as an argument.

+ var smileyFace = '\u263A';

+ var happy = 'I am $smileyFace';

+ happy.split(' '); // ['I', 'am', '☺']

+Here is an example of using `split()` with a regExp:

+ var nums = '2/7 3 4/5 3~/5';

+ var numsRegExp = new RegExp(r'(\s|/|~/)');

+ nums.split(numsRegExp); // ['2', '7', '3', '4', '5', '3', '5']

+In the code above, the string `nums` contains various numbers, some of which

+are expressed as fractions or as int-divisions. A regExp is used to split the

+string to extract just the numbers.

+You can perform operations on the matched and unmatched portions of a string

+when using `split()` with a regExp:

+ 'Eats SHOOTS leaves'.splitMapJoin((new RegExp(r'SHOOTS')),

+ onMatch: (m) => '*${m.group(0).toLowerCase()}*',

+ onNonMatch: (n) => n.toUpperCase()); // 'EATS *shoots* LEAVES'

+The regExp matches the middle word ('SHOOTS'). A pair of callbacks are

+registered to transform the matched and unmatched substrings before the

+substrings are joined together again.

+## Changing string case

+### Problem

+You want to change the case of strings.

+### Solution

+Use `string.toUpperCase()` and `string.toLowerCase()` to convert a string to

+lower-case or upper-case, respectively:

+ var theOneILove = 'I love Lucy';

+ theOneILove.toUpperCase(); // 'I LOVE LUCY!'

+ theOneILove.toLowerCase(); // 'i love lucy!'

+### Discussion

+Case changes affect the characters of bi-cameral scripts like Greek and French:

+ var zeus = '\u0394\u03af\u03b1\u03c2'; // 'Δίας' (Zeus in modern Greek)

+ zeus.toUpperCase(); // 'ΔΊΑΣ'

+ var resume = '\u0052\u00e9\u0073\u0075\u006d\u00e9'; // 'Résumé'

+ resume.toLowerCase(); // 'résumé'

+They do not affect the characters of uni-cameral scripts like Devanagari (used for

+writing many of the languages of India):

+ var chickenKebab = '\u091a\u093f\u0915\u0928 \u0915\u092c\u093e\u092c';

+ // 'चिकन कबाब' (in Devanagari)

+ chickenKebab.toLowerCase(); // 'चिकन कबाब'

+ chickenKebab.toUpperCase(); // 'चिकन कबाब'

+If a character's case does not change when using `toUpperCase()` and

+`toLowerCase()`, it is most likely because the character only has one

+form.

+## Determining whether a string contains another string

+### Problem

+You want to find out if a string is the substring of another string.

+### Solution

+Use `string.contains()`:

+ var fact = 'Dart strings are immutable';

+ string.contains('immutable'); // True.

+You can indicate a startIndex as a second argument:

+ string.contains('Dart', 2); // False

+### Discussion

+The String library provides a couple of shortcuts for testing whether a string

+is a substring of another:

+ string.startsWith('Dart'); // True.

+ string.endsWith('e'); // True.

+You can also use `string.indexOf()`, which returns -1 if the substring is

+not found within a string, and its matching index, if it is:

+ string.indexOf('art') != -1; // True, `art` is found in `Dart`

+You can also use a regExp and `hasMatch()`:

+ new RegExp(r'ar[et]').hasMatch(string); // True, 'art' and 'are' match.

+## Finding matches of a regExp pattern in a string

+### Problem

+You want to use regExp to match a pattern in a string, and

+want to be able to access the matches.

+### Solution

+Construct a regular expression using the RegExp class and find matches using

+the `allMatches()` method:

+ var neverEatingThat = 'Not with a fox, not in a box';

+ var regExp = new RegExp(r'[fb]ox');

+ List matches = regExp.allMatches(neverEatingThat);

+ matches.map((match) => match.group(0)).toList(); // ['fox', 'box']

+### Discussion

+You can query the object returned by `allMatches()` to find out the number of

+matches:

+ matches.length; // 2

+To find the first match, use `firstMatch()`:

+ regExp.firstMatch(neverEatingThat).group(0); // 'fox'

+To directly access the matched string, use `stringMatch()`:

+ regExp.stringMatch(neverEatingThat); // 'fox'

+ regExp.stringMatch('I like bagels and lox'); // null

+## Substituting strings based on regExp matches

+### Problem

+You want to match substrings within a string and make substitutions based on

+the matches.

+### Solution

+Construct a regular expression using the RegExp class and make replacements

+using `replaceAll()` method:

+ 'resume'.replaceAll(new RegExp(r'e'), '\u00E9'); // 'résumé'

+If you want to replace just the first match, use 'replaceFirst()`:

+ '0.0001'.replaceFirst(new RegExp(r'0+'), ''); // '.0001'

+The RegExp matches for one or more 0's and replaces them with an empty string.

+You can use `replaceAllMatched()` and register a function to modify the

+matches:

+ var heart = '\u2661'; // '♡'

+ var string = 'I like Ike but I $heart Lucy';

+ var regExp = new RegExp(r'[A-Z]\w+');

+ string.replaceAllMapped(regExp, (match) => match.group(0).toUpperCase());

+ // 'I like IKE but I ♡ LUCY'

« no previous file with comments | « recipes/pubspec.yaml ('k') | recipes/test/all_tests.dart » ('j') | recipes/test/core/strings/concatenating_strings_test.dart » ('J')