OLD | NEW |
| (Empty) |
1 # -*- coding: utf-8 -*- | |
2 # Copyright 2014 Google Inc. All Rights Reserved. | |
3 # | |
4 # Licensed under the Apache License, Version 2.0 (the "License"); | |
5 # you may not use this file except in compliance with the License. | |
6 # You may obtain a copy of the License at | |
7 # | |
8 # http://www.apache.org/licenses/LICENSE-2.0 | |
9 # | |
10 # Unless required by applicable law or agreed to in writing, software | |
11 # distributed under the License is distributed on an "AS IS" BASIS, | |
12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
13 # See the License for the specific language governing permissions and | |
14 # limitations under the License. | |
15 """Additional help about CRC32C and installing crcmod.""" | |
16 | |
17 from __future__ import absolute_import | |
18 | |
19 from gslib.help_provider import HelpProvider | |
20 | |
21 _DETAILED_HELP_TEXT = (""" | |
22 <B>OVERVIEW</B> | |
23 To minimize the chance for `filename encoding interoperability problems | |
24 <https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability>`
_ | |
25 gsutil requires use of the `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_ | |
26 character encoding when uploading and downloading files. Because UTF-8 is in | |
27 widespread (and growing) use, for most users nothing needs to be done to use | |
28 UTF-8. Users with files stored in other encodings (such as | |
29 `Latin 1 <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>`_) must convert those | |
30 filenames to UTF-8 before attempting to upload the files. | |
31 | |
32 The most common place where users who have filenames that use some other | |
33 encoding encounter a gsutil error is while uploading files using the recursive | |
34 (-R) option on the gsutil cp , mv, or rsync commands. When this happens you'll | |
35 get an error like this: | |
36 | |
37 CommandException: Invalid Unicode path encountered | |
38 ('dir1/dir2/file_name_with_\\xf6n_bad_chars'). | |
39 gsutil cannot proceed with such files present. | |
40 Please remove or rename this file and try again. | |
41 | |
42 Note that the invalid Unicode characters have been hex-encoded in this error | |
43 message because otherwise trying to print them would result in another | |
44 error. | |
45 | |
46 If you encounter such an error you can either remove the problematic file(s) | |
47 or try to rename them and re-run the command. If you have a modest number of | |
48 such files the simplest thing to do is to think of a different name for the | |
49 file and manually rename the file (using local filesystem tools). If you have | |
50 too many files for that to be practical you can use a tool to convert the old | |
51 character encoding to UTF-8. One such tool is `native2ascii | |
52 <http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/native2ascii.htm
l>`_. | |
53 | |
54 Note also that there's no restriction on the character encoding used in file | |
55 content - it can be UTF-8, a different encoding, or non-character | |
56 data (like audio or video content). The gsutil UTF-8 character encoding | |
57 requirement applies only to filenames. | |
58 | |
59 | |
60 <B>CROSS-PLATFORM ENCODING PROBLEMS OF WHICH TO BE AWARE</B> | |
61 Using UTF-8 for all object names and filenames will ensure that gsutil doesn't | |
62 encounter character encoding errors while operating on the files. | |
63 Unfortunately, it's still possible that files uploaded / downloaded this way | |
64 can have interoperability problems, for a number of reasons unrelated to | |
65 gsutil. For example: | |
66 | |
67 - Windows filenames are case-insensitive, while GCS, Linux and MacOS are | |
68 not. Thus, for example, if you have two filenames on Linux differing only | |
69 in case and upload both to GCS and then subsequently download them to | |
70 Windows, you will end up with just one file whose contents came from the | |
71 last of these files to be written to the filesystem. Moreover, case | |
72 translation is handled by tables that change across OS versions. | |
73 - Mac OS performs character encoding decomposition based on tables stored in | |
74 the OS, and the tables change between Unicode versions. Thus the encoding | |
75 used by an external library may not match that performed by the the OS. | |
76 - Windows console support for Unicode is difficult to use correctly. | |
77 | |
78 For a more thorough list of such issues see `this presentation | |
79 <http://www.i18nguy.com/unicode/filename-issues-iuc33.pdf>`_ | |
80 | |
81 These problems mostly arise when sharing data across platforms (e.g., | |
82 uploading data from a Windows machine to GCS, and then downloading from GCS | |
83 to a machine running MacOS). Unfortunately these problems are a consequence | |
84 of the lack of a filename encoding standard, and users need to be aware of the | |
85 kinds of problems that can arise when copying filenames across platforms. | |
86 | |
87 There is one precaution users can exercise to prevent some of these problems: | |
88 When using the Windows console specify wildcards or folders (using the -R | |
89 option) rather than explicitly named individual files. | |
90 """) | |
91 | |
92 | |
93 class CommandOptions(HelpProvider): | |
94 """Additional help about filename encoding and interoperability problems.""" | |
95 | |
96 # Help specification. See help_provider.py for documentation. | |
97 help_spec = HelpProvider.HelpSpec( | |
98 help_name='encoding', | |
99 help_name_aliases=['encodings', 'utf8', 'utf-8', 'latin1', 'unicode', | |
100 'interoperability'], | |
101 help_type='additional_help', | |
102 help_one_line_summary='Filename encoding and interoperability problems', | |
103 help_text=_DETAILED_HELP_TEXT, | |
104 subcommand_help_text={}, | |
105 ) | |
OLD | NEW |