|
|
Created:
7 years, 12 months ago by Ryan Tseng Modified:
7 years, 11 months ago CC:
chromium-reviews Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
DescriptionScripts to download files from google storage based on sha1 sums
BUG=153360
Patch Set 1 #Patch Set 2 : newline fixes #
Total comments: 44
Patch Set 3 : #Patch Set 4 : #
Total comments: 19
Patch Set 5 : Partial progress #Patch Set 6 : Changed upload script, fixed to run #
Total comments: 18
Patch Set 7 : Moved files into new folder #
Messages
Total messages: 6 (0 generated)
here are some initial comments https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... File build/download_from_google_storage.py (right): https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:19: import Queue these should be sorted alphabetically https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:20: from optparse import OptionParser insert an empty line before line 20 https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:21: GSUTIL_DEFAULT_PATH = os.path.join(os.path.dirname(os.path.normpath(__file__)), insert an empty line before line 21 https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:33: def call_interactive(self, *args): at indent of 2 spaces, there should be a single empty line before each "def" method definition https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:34: p = subprocess.Popen(('python', self.path) + args, stdout=sys.stdout, please use sys.executable instead of 'python' since the script may have been invoked with a particular python and we want to keep using that. also change line 38 to use sys.executable. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:47: else: since you return from the 'code == 0' branch at line 46, you can drop the else on line 47 and de-indent lines 48-57 by 2 spaces https://codereview.chromium.org/11664024/diff/2001/build/upload_to_google_sto... File build/upload_to_google_storage.py (right): https://codereview.chromium.org/11664024/diff/2001/build/upload_to_google_sto... build/upload_to_google_storage.py:5: """Script to download files from Google Storage.""" insert an empty line before line 5 https://codereview.chromium.org/11664024/diff/2001/build/upload_to_google_sto... build/upload_to_google_storage.py:19: '..', '..', 'third_party', 'gsutil', 'gsutil') similar comments here as what I wrote in download_from_google_storage.py
I only reviewed one of the files. You have some significant changes to make, so I'm going to hold off reviewing the second file until you upload another patch. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... File build/download_from_google_storage.py (right): https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:25: class Gsutil(): I strongly recommend (but will not require) timeout functionality for this class. Google Storage can be flaky and transfers can hang indefinitely. Refer to tools/build/scripts/slave/gitzip.py for an example. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:26: def __init__(self, path): Maybe optionally take a path to a credential file? def __init__(self, gsutil_path, boto_path=None): ... def call(self, *args): env = os.environ.copy() if self.boto_path is not None: env['AWS_CREDENTIAL_FILE'] = self.boto_path p = subprocess.Popen(..., env=env) Refer to GSUtilSetup() in tools/build/scripts/slave/slave_utils.py https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:30: raise IOError('GSUtil not found in %s' % path) I'd prefer for this error to be surfaced the same way that subprocess.Popen does, i.e., with OSError. In fact, it's probably OK to eliminate this error-checking entirely, and let subprocess.Popen do it. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:31: stdout = None This and the next line are no-ops https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:33: def call_interactive(self, *args): I recommend renaming this and the next method to call() and check_call(). That aligns with the convention of the subprocess module, and it's a bit more intuitive. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:34: p = subprocess.Popen(('python', self.path) + args, stdout=sys.stdout, stdin, stdout, stderr args are unnecessary; default behavior is to inherit these from the parent. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:60: def check_sha1(sha1_sum, filename): CamelCase https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:120: print >>sys.stderr, 'Missing target.' This is cryptic; please add a usage message. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:123: input_filename = args[0] What about handling multiple input files? Currently, providing multiple input files will silently ignore all but the first, which would be surprising to me. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:128: # 3. The input is a .sha1 file Please add an option to create a .sha1 file after download. It's not at all clear to me what the contents of a .sha1 file should be; there ought to be a standard programmatic way to create it. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:133: elif check_sha1(input_filename, input_filename): This is the most expensive condition to check, so it should go last, if possible. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:141: options.output = input_filename.replace('.sha1', '') Maybe I'm paranoid, but I prefer: options.output = input_filename[:-5] https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:147: print >>sys.stderr, 'Input %s not recognized.' % input_filename Please include the substance of this error message in your usage message. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:158: gsutil = Gsutil(options.gsutil_path) The way your code is structured, there will be a single Gsutil instance, and multiple threads will invoke methods on that instance simultaneously. That is a very dangerous approach. I strongly recommend you restructure so that there is one Gsutil instance per task. work_queue should contain instances of Gsutil, rather than tuples of (input, output). https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:214: # Have a thread set a flag when all the tasks are done. This is not necessary. You can just join() all the worker threads from the main thread, then exit. ctrl-c will still work. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:221: time.sleep(1) # Do a sleep loop so we can ctrl + c out of this anytime. My previous comment means this clause is unnecessary, but I want to emphasize something here: no sleep() in multi-processing code, never ever ever.
https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... File build/download_from_google_storage.py (right): https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:20: from optparse import OptionParser Actually, why import this one differently from the others? Just "import optparse" like the other stdlibs. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:23: '..', '..', 'third_party', 'gsutil', 'gsutil') Aligning at +10 seems quite arbitrary. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:28: if os.path.exists(path): if not os.path.exists(path): raise OSError('GSUtil not found in %s' % path) self.path = path https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:41: p = subprocess.Popen((sys.executable, self.path) + args, env=env) you want subprocess.call() https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:52: def _thread_main(): This code is never called? https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:83: sha1 = hashlib.sha1() FYI, this doesn't work with files > 1.5 gb or so. See the implementation in swarm_client/ for one that works with larger files. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:138: print >>sys.stderr, 'ERROR: Missing target.' parser.error() and line 142 too https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:149: # input_filename is a file? This could mean one of three things: Remove the guess work and require an argument. Guesswork = maintenance nightmare in a few years. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:227: t = threading.Thread(target=_downloader_worker_thread, args=[thread_num, Are you going to start 1000 threads if there are 1000 files to download? https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:235: t.join() return 0
Made changes to scripts, including: * Added a timeout to gsutil * More verbose help strings * Target is defined explicitly rather than implicitly * Each worker thread gets their own instance of the gsutil object via clone() * can specify custom .boto file * sha1 and md5 hashing done in chunks rather than the whole thing at once * call() -> check_call(), call_interactive() -> call() * State is not stored in the gsutil object, and stdout/stderr is returned as a result of check_call() * Added timers to show how long everything takes. * Got rid of the sleep loop, main thread just join()s on all the worker threads instead. (ctrl + c doesn't seem to work in all cases like this, but ctrl + z seems okay) Tested by going into chrome-internal/trunk/third_party/adobe/flash, running: find . -name .svn -prune -o -size +1000k -type f -print0 | upload_to_google_storage.py -0 - then running download_from_google_storage.py -d -r . in the same folder. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... File build/download_from_google_storage.py (right): https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:19: import Queue On 2013/01/04 17:42:04, cmp wrote: > these should be sorted alphabetically Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:20: from optparse import OptionParser On 2013/01/04 17:42:04, cmp wrote: > insert an empty line before line 20 Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:21: GSUTIL_DEFAULT_PATH = os.path.join(os.path.dirname(os.path.normpath(__file__)), On 2013/01/04 17:42:04, cmp wrote: > insert an empty line before line 21 Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:25: class Gsutil(): On 2013/01/04 17:54:02, szager1 wrote: > I strongly recommend (but will not require) timeout functionality for this > class. Google Storage can be flaky and transfers can hang indefinitely. Refer > to tools/build/scripts/slave/gitzip.py for an example. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:26: def __init__(self, path): On 2013/01/04 17:54:02, szager1 wrote: > Maybe optionally take a path to a credential file? > > def __init__(self, gsutil_path, boto_path=None): > ... > > def call(self, *args): > env = os.environ.copy() > if self.boto_path is not None: > env['AWS_CREDENTIAL_FILE'] = self.boto_path > p = subprocess.Popen(..., env=env) > > Refer to GSUtilSetup() in tools/build/scripts/slave/slave_utils.py Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:31: stdout = None On 2013/01/04 17:54:02, szager1 wrote: > This and the next line are no-ops Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:33: def call_interactive(self, *args): On 2013/01/04 17:54:02, szager1 wrote: > I recommend renaming this and the next method to call() and check_call(). That > aligns with the convention of the subprocess module, and it's a bit more > intuitive. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:34: p = subprocess.Popen(('python', self.path) + args, stdout=sys.stdout, On 2013/01/04 17:42:04, cmp wrote: > please use sys.executable instead of 'python' since the script may have been > invoked with a particular python and we want to keep using that. also change > line 38 to use sys.executable. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:47: else: On 2013/01/04 17:42:04, cmp wrote: > since you return from the 'code == 0' branch at line 46, you can drop the else > on line 47 and de-indent lines 48-57 by 2 spaces Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:60: def check_sha1(sha1_sum, filename): On 2013/01/04 17:54:02, szager1 wrote: > CamelCase Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:120: print >>sys.stderr, 'Missing target.' On 2013/01/04 17:54:02, szager1 wrote: > This is cryptic; please add a usage message. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:123: input_filename = args[0] Added error message and TODO https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:128: # 3. The input is a .sha1 file Its done using the upload_to_google_storage.py script. A sha1 file is just a file that contains a ([a-z0-9]{40}) sha1 sum. It doesn't really make sense to create a .sha1 file after download since the sha1 sum has to be known before anything can be downloaded, so its just as easy to go # echo "thesha1sumhere" > somefile.tar.gz.sha1 https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:133: elif check_sha1(input_filename, input_filename): On 2013/01/04 17:54:02, szager1 wrote: > This is the most expensive condition to check, so it should go last, if > possible. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:141: options.output = input_filename.replace('.sha1', '') On 2013/01/04 17:54:02, szager1 wrote: > Maybe I'm paranoid, but I prefer: > > options.output = input_filename[:-5] Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:147: print >>sys.stderr, 'Input %s not recognized.' % input_filename On 2013/01/04 17:54:02, szager1 wrote: > Please include the substance of this error message in your usage message. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:158: gsutil = Gsutil(options.gsutil_path) Done by cloning the gsutil instance for each worker thread. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:214: # Have a thread set a flag when all the tasks are done. On 2013/01/04 17:54:02, szager1 wrote: > This is not necessary. You can just join() all the worker threads from the main > thread, then exit. ctrl-c will still work. Done. https://codereview.chromium.org/11664024/diff/2001/build/download_from_google... build/download_from_google_storage.py:221: time.sleep(1) # Do a sleep loop so we can ctrl + c out of this anytime. On 2013/01/04 17:54:02, szager1 wrote: > My previous comment means this clause is unnecessary, but I want to emphasize > something here: no sleep() in multi-processing code, never ever ever. Done. https://codereview.chromium.org/11664024/diff/2001/build/upload_to_google_sto... File build/upload_to_google_storage.py (right): https://codereview.chromium.org/11664024/diff/2001/build/upload_to_google_sto... build/upload_to_google_storage.py:5: """Script to download files from Google Storage.""" On 2013/01/04 17:42:04, cmp wrote: > insert an empty line before line 5 Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... File build/download_from_google_storage.py (right): https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:20: from optparse import OptionParser On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > Actually, why import this one differently from the others? Just "import > optparse" like the other stdlibs. Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:23: '..', '..', 'third_party', 'gsutil', 'gsutil') Fixed https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:28: if os.path.exists(path): On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > if not os.path.exists(path): > raise OSError('GSUtil not found in %s' % path) > self.path = path Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:41: p = subprocess.Popen((sys.executable, self.path) + args, env=env) On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > you want subprocess.call() Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:52: def _thread_main(): On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > This code is never called? Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:83: sha1 = hashlib.sha1() On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > FYI, this doesn't work with files > 1.5 gb or so. See the implementation in > swarm_client/ for one that works with larger files. Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:138: print >>sys.stderr, 'ERROR: Missing target.' On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > parser.error() and line 142 too Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:149: # input_filename is a file? This could mean one of three things: On 2013/01/10 02:33:08, Marc-Antoine Ruel wrote: > Remove the guess work and require an argument. > > Guesswork = maintenance nightmare in a few years. Done. https://codereview.chromium.org/11664024/diff/18001/build/download_from_googl... build/download_from_google_storage.py:227: t = threading.Thread(target=_downloader_worker_thread, args=[thread_num, Nope, it'll start 1 thread by default, or 10 threads if "-t 10" is specified, and do 10 at a time.
https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... File build/download_from_google_storage.py (right): https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:21: # TODO(hinoka): This is currently incorrect. Should find a better default. Why is it incorrect. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:23: os.path.dirname(os.path.normpath(__file__)), You want abspath or realpath, not normpath. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:27: class Gsutil(): object https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:28: def __init__(self, path, boto_path=None, timeout=None): Are the default values necessary? https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:39: env = os.environ.copy() You can make this in the main thread instead of the worker thread. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:42: thr.status = subprocess.call((sys.executable, self.path) + args, env=env) subprocess2.check_call(timeout=) works well and is unit tested. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:102: input_sha1_sum, output_filename = q.get_nowait() Please move the except: up here and move the rest out of the try/except. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:111: print >>sys.stderr, 'File %s for %s does not exist, skipping.' % ( You are outputing from a thread? I'd recommend against that. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:152: help='The target is a file containing a sha1 sum. ' What's the goal of this flag? Is it necessary/useful? https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:162: elif len(args) > 1: s/elif/if/ on all of these https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:163: # TODO(hinoka): Multi target support. Not necessary. https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:189: raise Exception('Unreachable state.') raise NotImplementedError() https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:208: # Check if we have permissions to the Google Storage bucket. Since the argument parsing code is non trivial, which includes finding gsutil, I'd like you to split the actual work into another function. You can name it "download_from_google_storage" https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:228: if '.svn' in dirs: And .git? https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:237: sha1_match = re.search('([A-Za-z0-9]{40})', f.read(1024)) match(r'^...$', ...) ? https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:238: if sha1_match: Silently drop otherwise? https://codereview.chromium.org/11664024/diff/27001/build/download_from_googl... build/download_from_google_storage.py:246: # Start up all the worker threads. Why not start up before the enumeration? https://codereview.chromium.org/11664024/diff/27001/build/upload_to_google_st... File build/upload_to_google_storage.py (right): https://codereview.chromium.org/11664024/diff/27001/build/upload_to_google_st... build/upload_to_google_storage.py:42: class Gsutil(): Can you share code? |