Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(38)

Unified Diff: third_party/gsutil/boto/docs/source/emr_tut.rst

Issue 12042069: Scripts to download files from google storage based on sha1 sums (Closed) Base URL: https://chromium.googlesource.com/chromium/tools/depot_tools.git@master
Patch Set: Review fixes, updated gsutil Created 7 years, 10 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
Index: third_party/gsutil/boto/docs/source/emr_tut.rst
diff --git a/third_party/gsutil/boto/docs/source/emr_tut.rst b/third_party/gsutil/boto/docs/source/emr_tut.rst
new file mode 100644
index 0000000000000000000000000000000000000000..996781ee36387c6fd05fcd96215fab8819a2099d
--- /dev/null
+++ b/third_party/gsutil/boto/docs/source/emr_tut.rst
@@ -0,0 +1,108 @@
+.. _emr_tut:
+
+=====================================================
+An Introduction to boto's Elastic Mapreduce interface
+=====================================================
+
+This tutorial focuses on the boto interface to Elastic Mapreduce from
+Amazon Web Services. This tutorial assumes that you have already
+downloaded and installed boto.
+
+Creating a Connection
+---------------------
+The first step in accessing Elastic Mapreduce is to create a connection
+to the service. There are two ways to do this in boto. The first is:
+
+>>> from boto.emr.connection import EmrConnection
+>>> conn = EmrConnection('<aws access key>', '<aws secret key>')
+
+At this point the variable conn will point to an EmrConnection object.
+In this example, the AWS access key and AWS secret key are passed in to
+the method explicitly. Alternatively, you can set the environment variables:
+
+AWS_ACCESS_KEY_ID - Your AWS Access Key ID \
+AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key
+
+and then call the constructor without any arguments, like this:
+
+>>> conn = EmrConnection()
+
+There is also a shortcut function in the boto package called connect_emr
+that may provide a slightly easier means of creating a connection:
+
+>>> import boto
+>>> conn = boto.connect_emr()
+
+In either case, conn points to an EmrConnection object which we will use
+throughout the remainder of this tutorial.
+
+Creating Streaming JobFlow Steps
+--------------------------------
+Upon creating a connection to Elastic Mapreduce you will next
+want to create one or more jobflow steps. There are two types of steps, streaming
+and custom jar, both of which have a class in the boto Elastic Mapreduce implementation.
+
+Creating a streaming step that runs the AWS wordcount example, itself written in Python, can be accomplished by:
+
+>>> from boto.emr.step import StreamingStep
+>>> step = StreamingStep(name='My wordcount example',
+... mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
+... reducer='aggregate',
+... input='s3n://elasticmapreduce/samples/wordcount/input',
+... output='s3n://<my output bucket>/output/wordcount_output')
+
+where <my output bucket> is a bucket you have created in S3.
+
+Note that this statement does not run the step, that is accomplished later when we create a jobflow.
+
+Additional arguments of note to the streaming jobflow step are cache_files, cache_archive and step_args. The options cache_files and cache_archive enable you to use the Hadoops distributed cache to share files amongst the instances that run the step. The argument step_args allows one to pass additional arguments to Hadoop streaming, for example modifications to the Hadoop job configuration.
+
+Creating Custom Jar Job Flow Steps
+----------------------------------
+
+The second type of jobflow step executes tasks written with a custom jar. Creating a custom jar step for the AWS CloudBurst example can be accomplished by:
+
+>>> from boto.emr.step import JarStep
+>>> step = JarStep(name='Coudburst example',
+... jar='s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar',
+... step_args=['s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br',
+... 's3n://elasticmapreduce/samples/cloudburst/input/100k.br',
+... 's3n://<my output bucket>/output/cloudfront_output',
+... 36, 3, 0, 1, 240, 48, 24, 24, 128, 16])
+
+Note that this statement does not actually run the step, that is accomplished later when we create a jobflow. Also note that this JarStep does not include a main_class argument since the jar MANIFEST.MF has a Main-Class entry.
+
+Creating JobFlows
+-----------------
+Once you have created one or more jobflow steps, you will next want to create and run a jobflow. Creating a jobflow that executes either of the steps we created above can be accomplished by:
+
+>>> import boto
+>>> conn = boto.connect_emr()
+>>> jobid = conn.run_jobflow(name='My jobflow',
+... log_uri='s3://<my log uri>/jobflow_logs',
+... steps=[step])
+
+The method will not block for the completion of the jobflow, but will immediately return. The status of the jobflow can be determined by:
+
+>>> status = conn.describe_jobflow(jobid)
+>>> status.state
+u'STARTING'
+
+One can then use this state to block for a jobflow to complete. Valid jobflow states currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING.
+
+In some cases you may not have built all of the steps prior to running the jobflow. In these cases additional steps can be added to a jobflow by running:
+
+>>> conn.add_jobflow_steps(jobid, [second_step])
+
+If you wish to add additional steps to a running jobflow you may want to set the keep_alive parameter to True in run_jobflow so that the jobflow does not automatically terminate when the first step completes.
+
+The run_jobflow method has a number of important parameters that are worth investigating. They include parameters to change the number and type of EC2 instances on which the jobflow is executed, set a SSH key for manual debugging and enable AWS console debugging.
+
+Terminating JobFlows
+--------------------
+By default when all the steps of a jobflow have finished or failed the jobflow terminates. However, if you set the keep_alive parameter to True or just want to halt the execution of a jobflow early you can terminate a jobflow by:
+
+>>> import boto
+>>> conn = boto.connect_emr()
+>>> conn.terminate_jobflow('<jobflow id>')
+

Powered by Google App Engine
This is Rietveld 408576698