Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(9)

Unified Diff: third_party/gsutil/gslib/addlhelp/prod.py

Issue 1377933002: [catapult] - Copy Telemetry's gsutilz over to third_party. (Closed) Base URL: https://github.com/catapult-project/catapult.git@master
Patch Set: Rename to gsutil. Created 5 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « third_party/gsutil/gslib/addlhelp/naming.py ('k') | third_party/gsutil/gslib/addlhelp/projects.py » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: third_party/gsutil/gslib/addlhelp/prod.py
diff --git a/third_party/gsutil/gslib/addlhelp/prod.py b/third_party/gsutil/gslib/addlhelp/prod.py
new file mode 100644
index 0000000000000000000000000000000000000000..df090145d03a23b5a4a6f4d472031b5151637c6a
--- /dev/null
+++ b/third_party/gsutil/gslib/addlhelp/prod.py
@@ -0,0 +1,164 @@
+# -*- coding: utf-8 -*-
+# Copyright 2012 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Additional help about using gsutil for production tasks."""
+
+from __future__ import absolute_import
+
+from gslib.help_provider import HelpProvider
+
+_DETAILED_HELP_TEXT = ("""
+<B>OVERVIEW</B>
+ If you use gsutil in large production tasks (such as uploading or
+ downloading many GiBs of data each night), there are a number of things
+ you can do to help ensure success. Specifically, this section discusses
+ how to script large production tasks around gsutil's resumable transfer
+ mechanism.
+
+
+<B>BACKGROUND ON RESUMABLE TRANSFERS</B>
+ First, it's helpful to understand gsutil's resumable transfer mechanism,
+ and how your script needs to be implemented around this mechanism to work
+ reliably. gsutil uses resumable transfer support when you attempt to upload
+ or download a file larger than a configurable threshold (by default, this
+ threshold is 2 MiB). When a transfer fails partway through (e.g., because of
+ an intermittent network problem), gsutil uses a truncated randomized binary
+ exponential backoff-and-retry strategy that by default will retry transfers up
+ to 6 times over a 63 second period of time (see "gsutil help retries" for
+ details). If the transfer fails each of these attempts with no intervening
+ progress, gsutil gives up on the transfer, but keeps a "tracker" file for
+ it in a configurable location (the default location is ~/.gsutil/, in a file
+ named by a combination of the SHA1 hash of the name of the bucket and object
+ being transferred and the last 16 characters of the file name). When transfers
+ fail in this fashion, you can rerun gsutil at some later time (e.g., after
+ the networking problem has been resolved), and the resumable transfer picks
+ up where it left off.
+
+
+<B>SCRIPTING DATA TRANSFER TASKS</B>
+ To script large production data transfer tasks around this mechanism,
+ you can implement a script that runs periodically, determines which file
+ transfers have not yet succeeded, and runs gsutil to copy them. Below,
+ we offer a number of suggestions about how this type of scripting should
+ be implemented:
+
+ 1. When resumable transfers fail without any progress 6 times in a row
+ over the course of up to 63 seconds, it probably won't work to simply
+ retry the transfer immediately. A more successful strategy would be to
+ have a cron job that runs every 30 minutes, determines which transfers
+ need to be run, and runs them. If the network experiences intermittent
+ problems, the script picks up where it left off and will eventually
+ succeed (once the network problem has been resolved).
+
+ 2. If your business depends on timely data transfer, you should consider
+ implementing some network monitoring. For example, you can implement
+ a task that attempts a small download every few minutes and raises an
+ alert if the attempt fails for several attempts in a row (or more or less
+ frequently depending on your requirements), so that your IT staff can
+ investigate problems promptly. As usual with monitoring implementations,
+ you should experiment with the alerting thresholds, to avoid false
+ positive alerts that cause your staff to begin ignoring the alerts.
+
+ 3. There are a variety of ways you can determine what files remain to be
+ transferred. We recommend that you avoid attempting to get a complete
+ listing of a bucket containing many objects (e.g., tens of thousands
+ or more). One strategy is to structure your object names in a way that
+ represents your transfer process, and use gsutil prefix wildcards to
+ request partial bucket listings. For example, if your periodic process
+ involves downloading the current day's objects, you could name objects
+ using a year-month-day-object-ID format and then find today's objects by
+ using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it
+ is more efficient to have a non-wildcard prefix like this than to use
+ something like gsutil ls "gs://bucket/*-2011-09-27". The latter command
+ actually requests a complete bucket listing and then filters in gsutil,
+ while the former asks Google Storage to return the subset of objects
+ whose names start with everything up to the "*".
+
+ For data uploads, another technique would be to move local files from a "to
+ be processed" area to a "done" area as your script successfully copies
+ files to the cloud. You can do this in parallel batches by using a command
+ like:
+
+ gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i
+
+ where i is a shell loop variable. Make sure to check the shell $status
+ variable is 0 after each gsutil cp command, to detect if some of the copies
+ failed, and rerun the affected copies.
+
+ With this strategy, the file system keeps track of all remaining work to
+ be done.
+
+ 4. If you have really large numbers of objects in a single bucket
+ (say hundreds of thousands or more), you should consider tracking your
+ objects in a database instead of using bucket listings to enumerate
+ the objects. For example this database could track the state of your
+ downloads, so you can determine what objects need to be downloaded by
+ your periodic download script by querying the database locally instead
+ of performing a bucket listing.
+
+ 5. Make sure you don't delete partially downloaded files after a transfer
+ fails: gsutil picks up where it left off (and performs an MD5 check of
+ the final downloaded content to ensure data integrity), so deleting
+ partially transferred files will cause you to lose progress and make
+ more wasteful use of your network. You should also make sure whatever
+ process is waiting to consume the downloaded data doesn't get pointed
+ at the partially downloaded files. One way to do this is to download
+ into a staging directory and then move successfully downloaded files to
+ a directory where consumer processes will read them.
+
+ 6. If you have a fast network connection, you can speed up the transfer of
+ large numbers of files by using the gsutil -m (multi-threading /
+ multi-processing) option. Be aware, however, that gsutil doesn't attempt to
+ keep track of which files were downloaded successfully in cases where some
+ files failed to download. For example, if you use multi-threaded transfers
+ to download 100 files and 3 failed to download, it is up to your scripting
+ process to determine which transfers didn't succeed, and retry them. A
+ periodic check-and-run approach like outlined earlier would handle this
+ case.
+
+ If you use parallel transfers (gsutil -m) you might want to experiment with
+ the number of threads being used (via the parallel_thread_count setting
+ in the .boto config file). By default, gsutil uses 10 threads for Linux
+ and 24 threads for other operating systems. Depending on your network
+ speed, available memory, CPU load, and other conditions, this may or may
+ not be optimal. Try experimenting with higher or lower numbers of threads
+ to find the best number of threads for your environment.
+
+<B>RUNNING GSUTIL ON MULTIPLE MACHINES</B>
+ When running gsutil on multiple machines that are all attempting to use the
+ same OAuth2 refresh token, it is possible to encounter rate limiting errors
+ for the refresh requests (especially if all of these machines are likely to
+ start running gsutil at the same time). To account for this, gsutil will
+ automatically retry OAuth2 refresh requests with a truncated randomized
+ exponential backoff strategy like that which is described in the
+ "BACKGROUND ON RESUMABLE TRANSFERS" section above. The number of retries
+ attempted for OAuth2 refresh requests can be controlled via the
+ "oauth2_refresh_retries" variable in the .boto config file.
+""")
+
+
+class CommandOptions(HelpProvider):
+ """Additional help about using gsutil for production tasks."""
+
+ # Help specification. See help_provider.py for documentation.
+ help_spec = HelpProvider.HelpSpec(
+ help_name='prod',
+ help_name_aliases=[
+ 'production', 'resumable', 'resumable upload', 'resumable transfer',
+ 'resumable download', 'scripts', 'scripting'],
+ help_type='additional_help',
+ help_one_line_summary='Scripting Production Transfers',
+ help_text=_DETAILED_HELP_TEXT,
+ subcommand_help_text={},
+ )
« no previous file with comments | « third_party/gsutil/gslib/addlhelp/naming.py ('k') | third_party/gsutil/gslib/addlhelp/projects.py » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698