third_party/gsutil/boto/docs/source/emr_tut.rst - Issue 12042069: Scripts to download files from google storage based on sha1 sums

Unified Diff: third_party/gsutil/boto/docs/source/emr_tut.rst

Issue 12042069: Scripts to download files from google storage based on sha1 sums (Closed) Base URL: https://chromium.googlesource.com/chromium/tools/depot_tools.git@master

Patch Set: Review fixes, updated gsutil Created 7 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

« third_party/gsutil/README ('K') | « third_party/gsutil/boto/docs/source/elb_tut.rst ('k') | third_party/gsutil/boto/docs/source/index.rst » ('j') | third_party/gsutil/boto/tests/__init__.py » ('J')
Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

Index: third_party/gsutil/boto/docs/source/emr_tut.rst

diff --git a/third_party/gsutil/boto/docs/source/emr_tut.rst b/third_party/gsutil/boto/docs/source/emr_tut.rst

new file mode 100644

index 0000000000000000000000000000000000000000..996781ee36387c6fd05fcd96215fab8819a2099d

--- /dev/null

+++ b/third_party/gsutil/boto/docs/source/emr_tut.rst

@@ -0,0 +1,108 @@

+.. _emr_tut:

+=====================================================

+An Introduction to boto's Elastic Mapreduce interface

+=====================================================

+This tutorial focuses on the boto interface to Elastic Mapreduce from

+Amazon Web Services. This tutorial assumes that you have already

+downloaded and installed boto.

+Creating a Connection

+---------------------

+The first step in accessing Elastic Mapreduce is to create a connection

+to the service. There are two ways to do this in boto. The first is:

+>>> from boto.emr.connection import EmrConnection

+>>> conn = EmrConnection('<aws access key>', '<aws secret key>')

+At this point the variable conn will point to an EmrConnection object.

+In this example, the AWS access key and AWS secret key are passed in to

+the method explicitly. Alternatively, you can set the environment variables:

+AWS_ACCESS_KEY_ID - Your AWS Access Key ID \

+AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key

+and then call the constructor without any arguments, like this:

+>>> conn = EmrConnection()

+There is also a shortcut function in the boto package called connect_emr

+that may provide a slightly easier means of creating a connection:

+>>> import boto

+>>> conn = boto.connect_emr()

+In either case, conn points to an EmrConnection object which we will use

+throughout the remainder of this tutorial.

+Creating Streaming JobFlow Steps

+--------------------------------

+Upon creating a connection to Elastic Mapreduce you will next

+want to create one or more jobflow steps. There are two types of steps, streaming

+and custom jar, both of which have a class in the boto Elastic Mapreduce implementation.

+Creating a streaming step that runs the AWS wordcount example, itself written in Python, can be accomplished by:

+>>> from boto.emr.step import StreamingStep

+>>> step = StreamingStep(name='My wordcount example',

+... mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',

+... reducer='aggregate',

+... input='s3n://elasticmapreduce/samples/wordcount/input',

+... output='s3n://<my output bucket>/output/wordcount_output')

+where <my output bucket> is a bucket you have created in S3.

+Note that this statement does not run the step, that is accomplished later when we create a jobflow.

+Additional arguments of note to the streaming jobflow step are cache_files, cache_archive and step_args. The options cache_files and cache_archive enable you to use the Hadoops distributed cache to share files amongst the instances that run the step. The argument step_args allows one to pass additional arguments to Hadoop streaming, for example modifications to the Hadoop job configuration.

+Creating Custom Jar Job Flow Steps

+----------------------------------

+The second type of jobflow step executes tasks written with a custom jar. Creating a custom jar step for the AWS CloudBurst example can be accomplished by:

+>>> from boto.emr.step import JarStep

+>>> step = JarStep(name='Coudburst example',

+... jar='s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar',

+... step_args=['s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br',

+... 's3n://elasticmapreduce/samples/cloudburst/input/100k.br',

+... 's3n://<my output bucket>/output/cloudfront_output',

+... 36, 3, 0, 1, 240, 48, 24, 24, 128, 16])

+Note that this statement does not actually run the step, that is accomplished later when we create a jobflow. Also note that this JarStep does not include a main_class argument since the jar MANIFEST.MF has a Main-Class entry.

+Creating JobFlows

+-----------------

+Once you have created one or more jobflow steps, you will next want to create and run a jobflow. Creating a jobflow that executes either of the steps we created above can be accomplished by:

+>>> import boto

+>>> conn = boto.connect_emr()

+>>> jobid = conn.run_jobflow(name='My jobflow',

+... log_uri='s3://<my log uri>/jobflow_logs',

+... steps=[step])

+The method will not block for the completion of the jobflow, but will immediately return. The status of the jobflow can be determined by:

+>>> status = conn.describe_jobflow(jobid)

+>>> status.state

+u'STARTING'

+One can then use this state to block for a jobflow to complete. Valid jobflow states currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING.

+In some cases you may not have built all of the steps prior to running the jobflow. In these cases additional steps can be added to a jobflow by running:

+>>> conn.add_jobflow_steps(jobid, [second_step])

+If you wish to add additional steps to a running jobflow you may want to set the keep_alive parameter to True in run_jobflow so that the jobflow does not automatically terminate when the first step completes.

+The run_jobflow method has a number of important parameters that are worth investigating. They include parameters to change the number and type of EC2 instances on which the jobflow is executed, set a SSH key for manual debugging and enable AWS console debugging.

+Terminating JobFlows

+--------------------

+By default when all the steps of a jobflow have finished or failed the jobflow terminates. However, if you set the keep_alive parameter to True or just want to halt the execution of a jobflow early you can terminate a jobflow by:

+>>> import boto

+>>> conn = boto.connect_emr()

+>>> conn.terminate_jobflow('<jobflow id>')