third_party/gsutil/gslib/addlhelp/prod.py - Issue 1377933002: [catapult] - Copy Telemetry's gsutilz over to third_party.

Side by Side Diff: third_party/gsutil/gslib/addlhelp/prod.py

Issue 1377933002: [catapult] - Copy Telemetry's gsutilz over to third_party. (Closed) Base URL: https://github.com/catapult-project/catapult.git@master

Patch Set: Rename to gsutil. Created 5 years, 2 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 # -- coding: utf-8 --

	2 # Copyright 2012 Google Inc. All Rights Reserved.

	3 #

	4 # Licensed under the Apache License, Version 2.0 (the "License");

	5 # you may not use this file except in compliance with the License.

	6 # You may obtain a copy of the License at

	7 #

	8 # http://www.apache.org/licenses/LICENSE-2.0

	9 #

	10 # Unless required by applicable law or agreed to in writing, software

	11 # distributed under the License is distributed on an "AS IS" BASIS,

	12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

	13 # See the License for the specific language governing permissions and

	14 # limitations under the License.

	15 """Additional help about using gsutil for production tasks."""

	16

	17 from __future__ import absolute_import

	18

	19 from gslib.help_provider import HelpProvider

	20

	21 _DETAILED_HELP_TEXT = ("""

	22 <B>OVERVIEW</B>

	23 If you use gsutil in large production tasks (such as uploading or

	24 downloading many GiBs of data each night), there are a number of things

	25 you can do to help ensure success. Specifically, this section discusses

	26 how to script large production tasks around gsutil's resumable transfer

	27 mechanism.

	28

	29

	30 <B>BACKGROUND ON RESUMABLE TRANSFERS</B>

	31 First, it's helpful to understand gsutil's resumable transfer mechanism,

	32 and how your script needs to be implemented around this mechanism to work

	33 reliably. gsutil uses resumable transfer support when you attempt to upload

	34 or download a file larger than a configurable threshold (by default, this

	35 threshold is 2 MiB). When a transfer fails partway through (e.g., because of

	36 an intermittent network problem), gsutil uses a truncated randomized binary

	37 exponential backoff-and-retry strategy that by default will retry transfers up

	38 to 6 times over a 63 second period of time (see "gsutil help retries" for

	39 details). If the transfer fails each of these attempts with no intervening

	40 progress, gsutil gives up on the transfer, but keeps a "tracker" file for

	41 it in a configurable location (the default location is ~/.gsutil/, in a file

	42 named by a combination of the SHA1 hash of the name of the bucket and object

	43 being transferred and the last 16 characters of the file name). When transfers

	44 fail in this fashion, you can rerun gsutil at some later time (e.g., after

	45 the networking problem has been resolved), and the resumable transfer picks

	46 up where it left off.

	47

	48

	49 <B>SCRIPTING DATA TRANSFER TASKS</B>

	50 To script large production data transfer tasks around this mechanism,

	51 you can implement a script that runs periodically, determines which file

	52 transfers have not yet succeeded, and runs gsutil to copy them. Below,

	53 we offer a number of suggestions about how this type of scripting should

	54 be implemented:

	55

	56 1. When resumable transfers fail without any progress 6 times in a row

	57 over the course of up to 63 seconds, it probably won't work to simply

	58 retry the transfer immediately. A more successful strategy would be to

	59 have a cron job that runs every 30 minutes, determines which transfers

	60 need to be run, and runs them. If the network experiences intermittent

	61 problems, the script picks up where it left off and will eventually

	62 succeed (once the network problem has been resolved).

	63

	64 2. If your business depends on timely data transfer, you should consider

	65 implementing some network monitoring. For example, you can implement

	66 a task that attempts a small download every few minutes and raises an

	67 alert if the attempt fails for several attempts in a row (or more or less

	68 frequently depending on your requirements), so that your IT staff can

	69 investigate problems promptly. As usual with monitoring implementations,

	70 you should experiment with the alerting thresholds, to avoid false

	71 positive alerts that cause your staff to begin ignoring the alerts.

	72

	73 3. There are a variety of ways you can determine what files remain to be

	74 transferred. We recommend that you avoid attempting to get a complete

	75 listing of a bucket containing many objects (e.g., tens of thousands

	76 or more). One strategy is to structure your object names in a way that

	77 represents your transfer process, and use gsutil prefix wildcards to

	78 request partial bucket listings. For example, if your periodic process

	79 involves downloading the current day's objects, you could name objects

	80 using a year-month-day-object-ID format and then find today's objects by

	81 using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it

	82 is more efficient to have a non-wildcard prefix like this than to use

	83 something like gsutil ls "gs://bucket/*-2011-09-27". The latter command

	84 actually requests a complete bucket listing and then filters in gsutil,

	85 while the former asks Google Storage to return the subset of objects

	86 whose names start with everything up to the "*".

	87

	88 For data uploads, another technique would be to move local files from a "to

	89 be processed" area to a "done" area as your script successfully copies

	90 files to the cloud. You can do this in parallel batches by using a command

	91 like:

	92

	93 gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i

	94

	95 where i is a shell loop variable. Make sure to check the shell $status

	96 variable is 0 after each gsutil cp command, to detect if some of the copies

	97 failed, and rerun the affected copies.

	98

	99 With this strategy, the file system keeps track of all remaining work to

	100 be done.

	101

	102 4. If you have really large numbers of objects in a single bucket

	103 (say hundreds of thousands or more), you should consider tracking your

	104 objects in a database instead of using bucket listings to enumerate

	105 the objects. For example this database could track the state of your

	106 downloads, so you can determine what objects need to be downloaded by

	107 your periodic download script by querying the database locally instead

	108 of performing a bucket listing.

	109

	110 5. Make sure you don't delete partially downloaded files after a transfer

	111 fails: gsutil picks up where it left off (and performs an MD5 check of

	112 the final downloaded content to ensure data integrity), so deleting

	113 partially transferred files will cause you to lose progress and make

	114 more wasteful use of your network. You should also make sure whatever

	115 process is waiting to consume the downloaded data doesn't get pointed

	116 at the partially downloaded files. One way to do this is to download

	117 into a staging directory and then move successfully downloaded files to

	118 a directory where consumer processes will read them.

	119

	120 6. If you have a fast network connection, you can speed up the transfer of

	121 large numbers of files by using the gsutil -m (multi-threading /

	122 multi-processing) option. Be aware, however, that gsutil doesn't attempt to

	123 keep track of which files were downloaded successfully in cases where some

	124 files failed to download. For example, if you use multi-threaded transfers

	125 to download 100 files and 3 failed to download, it is up to your scripting

	126 process to determine which transfers didn't succeed, and retry them. A

	127 periodic check-and-run approach like outlined earlier would handle this

	128 case.

	129

	130 If you use parallel transfers (gsutil -m) you might want to experiment with

	131 the number of threads being used (via the parallel_thread_count setting

	132 in the .boto config file). By default, gsutil uses 10 threads for Linux

	133 and 24 threads for other operating systems. Depending on your network

	134 speed, available memory, CPU load, and other conditions, this may or may

	135 not be optimal. Try experimenting with higher or lower numbers of threads

	136 to find the best number of threads for your environment.

	137

	138 <B>RUNNING GSUTIL ON MULTIPLE MACHINES</B>

	139 When running gsutil on multiple machines that are all attempting to use the

	140 same OAuth2 refresh token, it is possible to encounter rate limiting errors

	141 for the refresh requests (especially if all of these machines are likely to

	142 start running gsutil at the same time). To account for this, gsutil will

	143 automatically retry OAuth2 refresh requests with a truncated randomized

	144 exponential backoff strategy like that which is described in the

	145 "BACKGROUND ON RESUMABLE TRANSFERS" section above. The number of retries

	146 attempted for OAuth2 refresh requests can be controlled via the

	147 "oauth2_refresh_retries" variable in the .boto config file.

	148 """)

	149

	150

	151 class CommandOptions(HelpProvider):

	152 """Additional help about using gsutil for production tasks."""

	153

	154 # Help specification. See help_provider.py for documentation.

	155 help_spec = HelpProvider.HelpSpec(

	156 help_name='prod',

	157 help_name_aliases=[

	158 'production', 'resumable', 'resumable upload', 'resumable transfer',

	159 'resumable download', 'scripts', 'scripting'],

	160 help_type='additional_help',

	161 help_one_line_summary='Scripting Production Transfers',

	162 help_text=_DETAILED_HELP_TEXT,

	163 subcommand_help_text={},

	164 )

OLD	NEW

« no previous file with comments | « third_party/gsutil/gslib/addlhelp/naming.py ('k') | third_party/gsutil/gslib/addlhelp/projects.py » ('j') | no next file with comments »