OLD | NEW |
| (Empty) |
1 # -*- coding: utf-8 -*- | |
2 # Copyright 2012 Google Inc. All Rights Reserved. | |
3 # | |
4 # Licensed under the Apache License, Version 2.0 (the "License"); | |
5 # you may not use this file except in compliance with the License. | |
6 # You may obtain a copy of the License at | |
7 # | |
8 # http://www.apache.org/licenses/LICENSE-2.0 | |
9 # | |
10 # Unless required by applicable law or agreed to in writing, software | |
11 # distributed under the License is distributed on an "AS IS" BASIS, | |
12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
13 # See the License for the specific language governing permissions and | |
14 # limitations under the License. | |
15 """Additional help about using gsutil for production tasks.""" | |
16 | |
17 from __future__ import absolute_import | |
18 | |
19 from gslib.help_provider import HelpProvider | |
20 | |
21 _DETAILED_HELP_TEXT = (""" | |
22 <B>OVERVIEW</B> | |
23 If you use gsutil in large production tasks (such as uploading or | |
24 downloading many GiBs of data each night), there are a number of things | |
25 you can do to help ensure success. Specifically, this section discusses | |
26 how to script large production tasks around gsutil's resumable transfer | |
27 mechanism. | |
28 | |
29 | |
30 <B>BACKGROUND ON RESUMABLE TRANSFERS</B> | |
31 First, it's helpful to understand gsutil's resumable transfer mechanism, | |
32 and how your script needs to be implemented around this mechanism to work | |
33 reliably. gsutil uses resumable transfer support when you attempt to upload | |
34 or download a file larger than a configurable threshold (by default, this | |
35 threshold is 2 MiB). When a transfer fails partway through (e.g., because of | |
36 an intermittent network problem), gsutil uses a truncated randomized binary | |
37 exponential backoff-and-retry strategy that by default will retry transfers up | |
38 to 6 times over a 63 second period of time (see "gsutil help retries" for | |
39 details). If the transfer fails each of these attempts with no intervening | |
40 progress, gsutil gives up on the transfer, but keeps a "tracker" file for | |
41 it in a configurable location (the default location is ~/.gsutil/, in a file | |
42 named by a combination of the SHA1 hash of the name of the bucket and object | |
43 being transferred and the last 16 characters of the file name). When transfers | |
44 fail in this fashion, you can rerun gsutil at some later time (e.g., after | |
45 the networking problem has been resolved), and the resumable transfer picks | |
46 up where it left off. | |
47 | |
48 | |
49 <B>SCRIPTING DATA TRANSFER TASKS</B> | |
50 To script large production data transfer tasks around this mechanism, | |
51 you can implement a script that runs periodically, determines which file | |
52 transfers have not yet succeeded, and runs gsutil to copy them. Below, | |
53 we offer a number of suggestions about how this type of scripting should | |
54 be implemented: | |
55 | |
56 1. When resumable transfers fail without any progress 6 times in a row | |
57 over the course of up to 63 seconds, it probably won't work to simply | |
58 retry the transfer immediately. A more successful strategy would be to | |
59 have a cron job that runs every 30 minutes, determines which transfers | |
60 need to be run, and runs them. If the network experiences intermittent | |
61 problems, the script picks up where it left off and will eventually | |
62 succeed (once the network problem has been resolved). | |
63 | |
64 2. If your business depends on timely data transfer, you should consider | |
65 implementing some network monitoring. For example, you can implement | |
66 a task that attempts a small download every few minutes and raises an | |
67 alert if the attempt fails for several attempts in a row (or more or less | |
68 frequently depending on your requirements), so that your IT staff can | |
69 investigate problems promptly. As usual with monitoring implementations, | |
70 you should experiment with the alerting thresholds, to avoid false | |
71 positive alerts that cause your staff to begin ignoring the alerts. | |
72 | |
73 3. There are a variety of ways you can determine what files remain to be | |
74 transferred. We recommend that you avoid attempting to get a complete | |
75 listing of a bucket containing many objects (e.g., tens of thousands | |
76 or more). One strategy is to structure your object names in a way that | |
77 represents your transfer process, and use gsutil prefix wildcards to | |
78 request partial bucket listings. For example, if your periodic process | |
79 involves downloading the current day's objects, you could name objects | |
80 using a year-month-day-object-ID format and then find today's objects by | |
81 using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it | |
82 is more efficient to have a non-wildcard prefix like this than to use | |
83 something like gsutil ls "gs://bucket/*-2011-09-27". The latter command | |
84 actually requests a complete bucket listing and then filters in gsutil, | |
85 while the former asks Google Storage to return the subset of objects | |
86 whose names start with everything up to the "*". | |
87 | |
88 For data uploads, another technique would be to move local files from a "to | |
89 be processed" area to a "done" area as your script successfully copies | |
90 files to the cloud. You can do this in parallel batches by using a command | |
91 like: | |
92 | |
93 gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i | |
94 | |
95 where i is a shell loop variable. Make sure to check the shell $status | |
96 variable is 0 after each gsutil cp command, to detect if some of the copies | |
97 failed, and rerun the affected copies. | |
98 | |
99 With this strategy, the file system keeps track of all remaining work to | |
100 be done. | |
101 | |
102 4. If you have really large numbers of objects in a single bucket | |
103 (say hundreds of thousands or more), you should consider tracking your | |
104 objects in a database instead of using bucket listings to enumerate | |
105 the objects. For example this database could track the state of your | |
106 downloads, so you can determine what objects need to be downloaded by | |
107 your periodic download script by querying the database locally instead | |
108 of performing a bucket listing. | |
109 | |
110 5. Make sure you don't delete partially downloaded files after a transfer | |
111 fails: gsutil picks up where it left off (and performs an MD5 check of | |
112 the final downloaded content to ensure data integrity), so deleting | |
113 partially transferred files will cause you to lose progress and make | |
114 more wasteful use of your network. You should also make sure whatever | |
115 process is waiting to consume the downloaded data doesn't get pointed | |
116 at the partially downloaded files. One way to do this is to download | |
117 into a staging directory and then move successfully downloaded files to | |
118 a directory where consumer processes will read them. | |
119 | |
120 6. If you have a fast network connection, you can speed up the transfer of | |
121 large numbers of files by using the gsutil -m (multi-threading / | |
122 multi-processing) option. Be aware, however, that gsutil doesn't attempt to | |
123 keep track of which files were downloaded successfully in cases where some | |
124 files failed to download. For example, if you use multi-threaded transfers | |
125 to download 100 files and 3 failed to download, it is up to your scripting | |
126 process to determine which transfers didn't succeed, and retry them. A | |
127 periodic check-and-run approach like outlined earlier would handle this | |
128 case. | |
129 | |
130 If you use parallel transfers (gsutil -m) you might want to experiment with | |
131 the number of threads being used (via the parallel_thread_count setting | |
132 in the .boto config file). By default, gsutil uses 10 threads for Linux | |
133 and 24 threads for other operating systems. Depending on your network | |
134 speed, available memory, CPU load, and other conditions, this may or may | |
135 not be optimal. Try experimenting with higher or lower numbers of threads | |
136 to find the best number of threads for your environment. | |
137 | |
138 <B>RUNNING GSUTIL ON MULTIPLE MACHINES</B> | |
139 When running gsutil on multiple machines that are all attempting to use the | |
140 same OAuth2 refresh token, it is possible to encounter rate limiting errors | |
141 for the refresh requests (especially if all of these machines are likely to | |
142 start running gsutil at the same time). To account for this, gsutil will | |
143 automatically retry OAuth2 refresh requests with a truncated randomized | |
144 exponential backoff strategy like that which is described in the | |
145 "BACKGROUND ON RESUMABLE TRANSFERS" section above. The number of retries | |
146 attempted for OAuth2 refresh requests can be controlled via the | |
147 "oauth2_refresh_retries" variable in the .boto config file. | |
148 """) | |
149 | |
150 | |
151 class CommandOptions(HelpProvider): | |
152 """Additional help about using gsutil for production tasks.""" | |
153 | |
154 # Help specification. See help_provider.py for documentation. | |
155 help_spec = HelpProvider.HelpSpec( | |
156 help_name='prod', | |
157 help_name_aliases=[ | |
158 'production', 'resumable', 'resumable upload', 'resumable transfer', | |
159 'resumable download', 'scripts', 'scripting'], | |
160 help_type='additional_help', | |
161 help_one_line_summary='Scripting Production Transfers', | |
162 help_text=_DETAILED_HELP_TEXT, | |
163 subcommand_help_text={}, | |
164 ) | |
OLD | NEW |