| OLD | NEW |
| (Empty) |
| 1 # Copyright 2012 Google Inc. All Rights Reserved. | |
| 2 # | |
| 3 # Licensed under the Apache License, Version 2.0 (the "License"); | |
| 4 # you may not use this file except in compliance with the License. | |
| 5 # You may obtain a copy of the License at | |
| 6 # | |
| 7 # http://www.apache.org/licenses/LICENSE-2.0 | |
| 8 # | |
| 9 # Unless required by applicable law or agreed to in writing, software | |
| 10 # distributed under the License is distributed on an "AS IS" BASIS, | |
| 11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| 12 # See the License for the specific language governing permissions and | |
| 13 # limitations under the License. | |
| 14 | |
| 15 from gslib.help_provider import HELP_NAME | |
| 16 from gslib.help_provider import HELP_NAME_ALIASES | |
| 17 from gslib.help_provider import HELP_ONE_LINE_SUMMARY | |
| 18 from gslib.help_provider import HelpProvider | |
| 19 from gslib.help_provider import HELP_TEXT | |
| 20 from gslib.help_provider import HelpType | |
| 21 from gslib.help_provider import HELP_TYPE | |
| 22 | |
| 23 _detailed_help_text = (""" | |
| 24 <B>OVERVIEW</B> | |
| 25 If you use gsutil in large production tasks (such as uploading or | |
| 26 downloading many GB of data each night), there are a number of things | |
| 27 you can do to help ensure success. Specifically, this section discusses | |
| 28 how to script large production tasks around gsutil's resumable transfer | |
| 29 mechanism. | |
| 30 | |
| 31 | |
| 32 <B>BACKGROUND ON RESUMABLE TRANSFERS</B> | |
| 33 First, it's helpful to understand gsutil's resumable transfer mechanism, | |
| 34 and how your script needs to be implemented around this mechanism to work | |
| 35 reliably. gsutil uses the resumable transfer support in the boto library | |
| 36 when you attempt to upload or download a file larger than a configurable | |
| 37 threshold (by default, this threshold is 1MB). When a transfer fails | |
| 38 partway through (e.g., because of an intermittent network problem), | |
| 39 boto uses a randomized binary exponential backoff-and-retry strategy: | |
| 40 wait a random period between [0..1] seconds and retry; if that fails, | |
| 41 wait a random period between [0..2] seconds and retry; and if that | |
| 42 fails, wait a random period between [0..4] seconds, and so on, up to a | |
| 43 configurable number of times (the default is 6 times). Thus, the retry | |
| 44 actually spans a randomized period up to 1+2+4+8+16+32=63 seconds. | |
| 45 | |
| 46 If the transfer fails each of these attempts with no intervening | |
| 47 progress, gsutil gives up on the transfer, but keeps a "tracker" file | |
| 48 for it in a configurable location (the default location is ~/.gsutil/, | |
| 49 in a file named by a combination of the SHA1 hash of the name of the | |
| 50 bucket and object being transferred and the last 16 characters of the | |
| 51 file name). When transfers fail in this fashion, you can rerun gsutil | |
| 52 at some later time (e.g., after the networking problem has been | |
| 53 resolved), and the resumable transfer picks up where it left off. | |
| 54 | |
| 55 | |
| 56 <B>SCRIPTING DATA TRANSFER TASKS</B> | |
| 57 To script large production data transfer tasks around this mechanism, | |
| 58 you can implement a script that runs periodically, determines which file | |
| 59 transfers have not yet succeeded, and runs gsutil to copy them. Below, | |
| 60 we offer a number of suggestions about how this type of scripting should | |
| 61 be implemented: | |
| 62 | |
| 63 1. When resumable transfers fail without any progress 6 times in a row | |
| 64 over the course of up to 63 seconds, it probably won't work to simply | |
| 65 retry the transfer immediately. A more successful strategy would be to | |
| 66 have a cron job that runs every 30 minutes, determines which transfers | |
| 67 need to be run, and runs them. If the network experiences intermittent | |
| 68 problems, the script picks up where it left off and will eventually | |
| 69 succeed (once the network problem has been resolved). | |
| 70 | |
| 71 2. If your business depends on timely data transfer, you should consider | |
| 72 implementing some network monitoring. For example, you can implement | |
| 73 a task that attempts a small download every few minutes and raises an | |
| 74 alert if the attempt fails for several attempts in a row (or more or less | |
| 75 frequently depending on your requirements), so that your IT staff can | |
| 76 investigate problems promptly. As usual with monitoring implementations, | |
| 77 you should experiment with the alerting thresholds, to avoid false | |
| 78 positive alerts that cause your staff to begin ignoring the alerts. | |
| 79 | |
| 80 3. There are a variety of ways you can determine what files remain to be | |
| 81 transferred. We recommend that you avoid attempting to get a complete | |
| 82 listing of a bucket containing many objects (e.g., tens of thousands | |
| 83 or more). One strategy is to structure your object names in a way that | |
| 84 represents your transfer process, and use gsutil prefix wildcards to | |
| 85 request partial bucket listings. For example, if your periodic process | |
| 86 involves downloading the current day's objects, you could name objects | |
| 87 using a year-month-day-object-ID format and then find today's objects by | |
| 88 using a command like gsutil ls gs://bucket/2011-09-27-*. Note that it | |
| 89 is more efficient to have a non-wildcard prefix like this than to use | |
| 90 something like gsutil ls gs://bucket/*-2011-09-27. The latter command | |
| 91 actually requests a complete bucket listing and then filters in gsutil, | |
| 92 while the former asks Google Storage to return the subset of objects | |
| 93 whose names start with everything up to the *. | |
| 94 | |
| 95 For data uploads, another technique would be to move local files from a "to | |
| 96 be processed" area to a "done" area as your script successfully copies files | |
| 97 to the cloud. You can do this in parallel batches by using a command like: | |
| 98 | |
| 99 gsutil -m cp -R to_upload/subdir_$i gs://bucket/subdir_$i | |
| 100 | |
| 101 where i is a shell loop variable. Make sure to check the shell $status | |
| 102 variable is 0 after each gsutil cp command, to detect if some of the copies | |
| 103 failed, and rerun the affected copies. | |
| 104 | |
| 105 With this strategy, the file system keeps track of all remaining work to | |
| 106 be done. | |
| 107 | |
| 108 4. If you have really large numbers of objects in a single bucket | |
| 109 (say hundreds of thousands or more), you should consider tracking your | |
| 110 objects in a database instead of using bucket listings to enumerate | |
| 111 the objects. For example this database could track the state of your | |
| 112 downloads, so you can determine what objects need to be downloaded by | |
| 113 your periodic download script by querying the database locally instead | |
| 114 of performing a bucket listing. | |
| 115 | |
| 116 5. Make sure you don't delete partially downloaded files after a transfer | |
| 117 fails: gsutil picks up where it left off (and performs an MD5 check of | |
| 118 the final downloaded content to ensure data integrity), so deleting | |
| 119 partially transferred files will cause you to lose progress and make | |
| 120 more wasteful use of your network. You should also make sure whatever | |
| 121 process is waiting to consume the downloaded data doesn't get pointed | |
| 122 at the partially downloaded files. One way to do this is to download | |
| 123 into a staging directory and then move successfully downloaded files to | |
| 124 a directory where consumer processes will read them. | |
| 125 | |
| 126 6. If you have a fast network connection, you can speed up the transfer of | |
| 127 large numbers of files by using the gsutil -m (multi-threading / | |
| 128 multi-processing) option. Be aware, however, that gsutil doesn't attempt to | |
| 129 keep track of which files were downloaded successfully in cases where some | |
| 130 files failed to download. For example, if you use multi-threaded transfers | |
| 131 to download 100 files and 3 failed to download, it is up to your scripting | |
| 132 process to determine which transfers didn't succeed, and retry them. A | |
| 133 periodic check-and-run approach like outlined earlier would handle this case. | |
| 134 | |
| 135 If you use parallel transfers (gsutil -m) you might want to experiment with | |
| 136 the number of threads being used (via the parallel_thread_count setting | |
| 137 in the .boto config file). By default, gsutil uses 24 threads. Depending | |
| 138 on your network speed, available memory, CPU load, and other conditions, | |
| 139 this may or may not be optimal. Try experimenting with higher or lower | |
| 140 numbers of threads, to find the best number of threads for your environment. | |
| 141 """) | |
| 142 | |
| 143 | |
| 144 class CommandOptions(HelpProvider): | |
| 145 """Additional help about using gsutil for production tasks.""" | |
| 146 | |
| 147 help_spec = { | |
| 148 # Name of command or auxiliary help info for which this help applies. | |
| 149 HELP_NAME : 'prod', | |
| 150 # List of help name aliases. | |
| 151 HELP_NAME_ALIASES : ['production', 'resumable', 'resumable upload', | |
| 152 'resumable transfer', 'resumable download', | |
| 153 'scripts', 'scripting'], | |
| 154 # Type of help: | |
| 155 HELP_TYPE : HelpType.ADDITIONAL_HELP, | |
| 156 # One line summary of this help. | |
| 157 HELP_ONE_LINE_SUMMARY : 'Scripting production data transfers with gsutil', | |
| 158 # The full help text. | |
| 159 HELP_TEXT : _detailed_help_text, | |
| 160 } | |
| OLD | NEW |