OLD | NEW |
(Empty) | |
| 1 # -*- coding: utf-8 -*- |
| 2 # Copyright 2014 Google Inc. All Rights Reserved. |
| 3 # |
| 4 # Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 # you may not use this file except in compliance with the License. |
| 6 # You may obtain a copy of the License at |
| 7 # |
| 8 # http://www.apache.org/licenses/LICENSE-2.0 |
| 9 # |
| 10 # Unless required by applicable law or agreed to in writing, software |
| 11 # distributed under the License is distributed on an "AS IS" BASIS, |
| 12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 # See the License for the specific language governing permissions and |
| 14 # limitations under the License. |
| 15 """Additional help about CRC32C and installing crcmod.""" |
| 16 |
| 17 from __future__ import absolute_import |
| 18 |
| 19 from gslib.help_provider import HelpProvider |
| 20 |
| 21 _DETAILED_HELP_TEXT = (""" |
| 22 <B>OVERVIEW</B> |
| 23 To minimize the chance for `filename encoding interoperability problems |
| 24 <https://en.wikipedia.org/wiki/Filename#Encoding_indication_interoperability>`
_ |
| 25 gsutil requires use of the `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_ |
| 26 character encoding when uploading and downloading files. Because UTF-8 is in |
| 27 widespread (and growing) use, for most users nothing needs to be done to use |
| 28 UTF-8. Users with files stored in other encodings (such as |
| 29 `Latin 1 <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>`_) must convert those |
| 30 filenames to UTF-8 before attempting to upload the files. |
| 31 |
| 32 The most common place where users who have filenames that use some other |
| 33 encoding encounter a gsutil error is while uploading files using the recursive |
| 34 (-R) option on the gsutil cp , mv, or rsync commands. When this happens you'll |
| 35 get an error like this: |
| 36 |
| 37 CommandException: Invalid Unicode path encountered |
| 38 ('dir1/dir2/file_name_with_\\xf6n_bad_chars'). |
| 39 gsutil cannot proceed with such files present. |
| 40 Please remove or rename this file and try again. |
| 41 |
| 42 Note that the invalid Unicode characters have been hex-encoded in this error |
| 43 message because otherwise trying to print them would result in another |
| 44 error. |
| 45 |
| 46 If you encounter such an error you can either remove the problematic file(s) |
| 47 or try to rename them and re-run the command. If you have a modest number of |
| 48 such files the simplest thing to do is to think of a different name for the |
| 49 file and manually rename the file (using local filesystem tools). If you have |
| 50 too many files for that to be practical you can use a tool to convert the old |
| 51 character encoding to UTF-8. One such tool is `native2ascii |
| 52 <http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/native2ascii.htm
l>`_. |
| 53 |
| 54 Note also that there's no restriction on the character encoding used in file |
| 55 content - it can be UTF-8, a different encoding, or non-character |
| 56 data (like audio or video content). The gsutil UTF-8 character encoding |
| 57 requirement applies only to filenames. |
| 58 |
| 59 |
| 60 <B>CROSS-PLATFORM ENCODING PROBLEMS OF WHICH TO BE AWARE</B> |
| 61 Using UTF-8 for all object names and filenames will ensure that gsutil doesn't |
| 62 encounter character encoding errors while operating on the files. |
| 63 Unfortunately, it's still possible that files uploaded / downloaded this way |
| 64 can have interoperability problems, for a number of reasons unrelated to |
| 65 gsutil. For example: |
| 66 |
| 67 - Windows filenames are case-insensitive, while GCS, Linux and MacOS are |
| 68 not. Thus, for example, if you have two filenames on Linux differing only |
| 69 in case and upload both to GCS and then subsequently download them to |
| 70 Windows, you will end up with just one file whose contents came from the |
| 71 last of these files to be written to the filesystem. Moreover, case |
| 72 translation is handled by tables that change across OS versions. |
| 73 - Mac OS performs character encoding decomposition based on tables stored in |
| 74 the OS, and the tables change between Unicode versions. Thus the encoding |
| 75 used by an external library may not match that performed by the the OS. |
| 76 - Windows console support for Unicode is difficult to use correctly. |
| 77 |
| 78 For a more thorough list of such issues see `this presentation |
| 79 <http://www.i18nguy.com/unicode/filename-issues-iuc33.pdf>`_ |
| 80 |
| 81 These problems mostly arise when sharing data across platforms (e.g., |
| 82 uploading data from a Windows machine to GCS, and then downloading from GCS |
| 83 to a machine running MacOS). Unfortunately these problems are a consequence |
| 84 of the lack of a filename encoding standard, and users need to be aware of the |
| 85 kinds of problems that can arise when copying filenames across platforms. |
| 86 |
| 87 There is one precaution users can exercise to prevent some of these problems: |
| 88 When using the Windows console specify wildcards or folders (using the -R |
| 89 option) rather than explicitly named individual files. |
| 90 """) |
| 91 |
| 92 |
| 93 class CommandOptions(HelpProvider): |
| 94 """Additional help about filename encoding and interoperability problems.""" |
| 95 |
| 96 # Help specification. See help_provider.py for documentation. |
| 97 help_spec = HelpProvider.HelpSpec( |
| 98 help_name='encoding', |
| 99 help_name_aliases=['encodings', 'utf8', 'utf-8', 'latin1', 'unicode', |
| 100 'interoperability'], |
| 101 help_type='additional_help', |
| 102 help_one_line_summary='Filename encoding and interoperability problems', |
| 103 help_text=_DETAILED_HELP_TEXT, |
| 104 subcommand_help_text={}, |
| 105 ) |
OLD | NEW |