Index: third_party/gsutil/boto/docs/source/emr_tut.rst |
diff --git a/third_party/gsutil/boto/docs/source/emr_tut.rst b/third_party/gsutil/boto/docs/source/emr_tut.rst |
new file mode 100644 |
index 0000000000000000000000000000000000000000..996781ee36387c6fd05fcd96215fab8819a2099d |
--- /dev/null |
+++ b/third_party/gsutil/boto/docs/source/emr_tut.rst |
@@ -0,0 +1,108 @@ |
+.. _emr_tut: |
+ |
+===================================================== |
+An Introduction to boto's Elastic Mapreduce interface |
+===================================================== |
+ |
+This tutorial focuses on the boto interface to Elastic Mapreduce from |
+Amazon Web Services. This tutorial assumes that you have already |
+downloaded and installed boto. |
+ |
+Creating a Connection |
+--------------------- |
+The first step in accessing Elastic Mapreduce is to create a connection |
+to the service. There are two ways to do this in boto. The first is: |
+ |
+>>> from boto.emr.connection import EmrConnection |
+>>> conn = EmrConnection('<aws access key>', '<aws secret key>') |
+ |
+At this point the variable conn will point to an EmrConnection object. |
+In this example, the AWS access key and AWS secret key are passed in to |
+the method explicitly. Alternatively, you can set the environment variables: |
+ |
+AWS_ACCESS_KEY_ID - Your AWS Access Key ID \ |
+AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key |
+ |
+and then call the constructor without any arguments, like this: |
+ |
+>>> conn = EmrConnection() |
+ |
+There is also a shortcut function in the boto package called connect_emr |
+that may provide a slightly easier means of creating a connection: |
+ |
+>>> import boto |
+>>> conn = boto.connect_emr() |
+ |
+In either case, conn points to an EmrConnection object which we will use |
+throughout the remainder of this tutorial. |
+ |
+Creating Streaming JobFlow Steps |
+-------------------------------- |
+Upon creating a connection to Elastic Mapreduce you will next |
+want to create one or more jobflow steps. There are two types of steps, streaming |
+and custom jar, both of which have a class in the boto Elastic Mapreduce implementation. |
+ |
+Creating a streaming step that runs the AWS wordcount example, itself written in Python, can be accomplished by: |
+ |
+>>> from boto.emr.step import StreamingStep |
+>>> step = StreamingStep(name='My wordcount example', |
+... mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py', |
+... reducer='aggregate', |
+... input='s3n://elasticmapreduce/samples/wordcount/input', |
+... output='s3n://<my output bucket>/output/wordcount_output') |
+ |
+where <my output bucket> is a bucket you have created in S3. |
+ |
+Note that this statement does not run the step, that is accomplished later when we create a jobflow. |
+ |
+Additional arguments of note to the streaming jobflow step are cache_files, cache_archive and step_args. The options cache_files and cache_archive enable you to use the Hadoops distributed cache to share files amongst the instances that run the step. The argument step_args allows one to pass additional arguments to Hadoop streaming, for example modifications to the Hadoop job configuration. |
+ |
+Creating Custom Jar Job Flow Steps |
+---------------------------------- |
+ |
+The second type of jobflow step executes tasks written with a custom jar. Creating a custom jar step for the AWS CloudBurst example can be accomplished by: |
+ |
+>>> from boto.emr.step import JarStep |
+>>> step = JarStep(name='Coudburst example', |
+... jar='s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar', |
+... step_args=['s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br', |
+... 's3n://elasticmapreduce/samples/cloudburst/input/100k.br', |
+... 's3n://<my output bucket>/output/cloudfront_output', |
+... 36, 3, 0, 1, 240, 48, 24, 24, 128, 16]) |
+ |
+Note that this statement does not actually run the step, that is accomplished later when we create a jobflow. Also note that this JarStep does not include a main_class argument since the jar MANIFEST.MF has a Main-Class entry. |
+ |
+Creating JobFlows |
+----------------- |
+Once you have created one or more jobflow steps, you will next want to create and run a jobflow. Creating a jobflow that executes either of the steps we created above can be accomplished by: |
+ |
+>>> import boto |
+>>> conn = boto.connect_emr() |
+>>> jobid = conn.run_jobflow(name='My jobflow', |
+... log_uri='s3://<my log uri>/jobflow_logs', |
+... steps=[step]) |
+ |
+The method will not block for the completion of the jobflow, but will immediately return. The status of the jobflow can be determined by: |
+ |
+>>> status = conn.describe_jobflow(jobid) |
+>>> status.state |
+u'STARTING' |
+ |
+One can then use this state to block for a jobflow to complete. Valid jobflow states currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING. |
+ |
+In some cases you may not have built all of the steps prior to running the jobflow. In these cases additional steps can be added to a jobflow by running: |
+ |
+>>> conn.add_jobflow_steps(jobid, [second_step]) |
+ |
+If you wish to add additional steps to a running jobflow you may want to set the keep_alive parameter to True in run_jobflow so that the jobflow does not automatically terminate when the first step completes. |
+ |
+The run_jobflow method has a number of important parameters that are worth investigating. They include parameters to change the number and type of EC2 instances on which the jobflow is executed, set a SSH key for manual debugging and enable AWS console debugging. |
+ |
+Terminating JobFlows |
+-------------------- |
+By default when all the steps of a jobflow have finished or failed the jobflow terminates. However, if you set the keep_alive parameter to True or just want to halt the execution of a jobflow early you can terminate a jobflow by: |
+ |
+>>> import boto |
+>>> conn = boto.connect_emr() |
+>>> conn.terminate_jobflow('<jobflow id>') |
+ |