third_party/gsutil/boto/docs/source/emr_tut.rst - Issue 12042069: Scripts to download files from google storage based on sha1 sums

Side by Side Diff: third_party/gsutil/boto/docs/source/emr_tut.rst

Issue 12042069: Scripts to download files from google storage based on sha1 sums (Closed) Base URL: https://chromium.googlesource.com/chromium/tools/depot_tools.git@master

Patch Set: Review fixes, updated gsutil Created 7 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 .. _emr_tut:

	2

	3 =====================================================

	4 An Introduction to boto's Elastic Mapreduce interface

	5 =====================================================

	6

	7 This tutorial focuses on the boto interface to Elastic Mapreduce from

	8 Amazon Web Services. This tutorial assumes that you have already

	9 downloaded and installed boto.

	10

	11 Creating a Connection

	12 ---------------------

	13 The first step in accessing Elastic Mapreduce is to create a connection

	14 to the service. There are two ways to do this in boto. The first is:

	15

	16 >>> from boto.emr.connection import EmrConnection

	17 >>> conn = EmrConnection('<aws access key>', '<aws secret key>')

	18

	19 At this point the variable conn will point to an EmrConnection object.

	20 In this example, the AWS access key and AWS secret key are passed in to

	21 the method explicitly. Alternatively, you can set the environment variables:

	22

	23 AWS_ACCESS_KEY_ID - Your AWS Access Key ID \

	24 AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key

	25

	26 and then call the constructor without any arguments, like this:

	27

	28 >>> conn = EmrConnection()

	29

	30 There is also a shortcut function in the boto package called connect_emr

	31 that may provide a slightly easier means of creating a connection:

	32

	33 >>> import boto

	34 >>> conn = boto.connect_emr()

	35

	36 In either case, conn points to an EmrConnection object which we will use

	37 throughout the remainder of this tutorial.

	38

	39 Creating Streaming JobFlow Steps

	40 --------------------------------

	41 Upon creating a connection to Elastic Mapreduce you will next

	42 want to create one or more jobflow steps. There are two types of steps, streami ng

	43 and custom jar, both of which have a class in the boto Elastic Mapreduce impleme ntation.

	44

	45 Creating a streaming step that runs the AWS wordcount example, itself written in Python, can be accomplished by:

	46

	47 >>> from boto.emr.step import StreamingStep

	48 >>> step = StreamingStep(name='My wordcount example',

	49 ... mapper='s3n://elasticmapreduce/samples/wordcount/wordSp litter.py',

	50 ... reducer='aggregate',

	51 ... input='s3n://elasticmapreduce/samples/wordcount/input',

	52 ... output='s3n://<my output bucket>/output/wordcount_outpu t')

	53

	54 where <my output bucket> is a bucket you have created in S3.

	55

	56 Note that this statement does not run the step, that is accomplished later when we create a jobflow.

	57

	58 Additional arguments of note to the streaming jobflow step are cache_files, cach e_archive and step_args. The options cache_files and cache_archive enable you t o use the Hadoops distributed cache to share files amongst the instances that ru n the step. The argument step_args allows one to pass additional arguments to H adoop streaming, for example modifications to the Hadoop job configuration.

	59

	60 Creating Custom Jar Job Flow Steps

	61 ----------------------------------

	62

	63 The second type of jobflow step executes tasks written with a custom jar. Creat ing a custom jar step for the AWS CloudBurst example can be accomplished by:

	64

	65 >>> from boto.emr.step import JarStep

	66 >>> step = JarStep(name='Coudburst example',

	67 ... jar='s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar ',

	68 ... step_args=['s3n://elasticmapreduce/samples/cloudburst/input/s _suis.br',

	69 ... 's3n://elasticmapreduce/samples/cloudburst/input/1 00k.br',

	70 ... 's3n://<my output bucket>/output/cloudfront_output ',

	71 ... 36, 3, 0, 1, 240, 48, 24, 24, 128, 16])

	72

	73 Note that this statement does not actually run the step, that is accomplished la ter when we create a jobflow. Also note that this JarStep does not include a ma in_class argument since the jar MANIFEST.MF has a Main-Class entry.

	74

	75 Creating JobFlows

	76 -----------------

	77 Once you have created one or more jobflow steps, you will next want to create an d run a jobflow. Creating a jobflow that executes either of the steps we create d above can be accomplished by:

	78

	79 >>> import boto

	80 >>> conn = boto.connect_emr()

	81 >>> jobid = conn.run_jobflow(name='My jobflow',

	82 ... log_uri='s3://<my log uri>/jobflow_logs',

	83 ... steps=[step])

	84

	85 The method will not block for the completion of the jobflow, but will immediatel y return. The status of the jobflow can be determined by:

	86

	87 >>> status = conn.describe_jobflow(jobid)

	88 >>> status.state

	89 u'STARTING'

	90

	91 One can then use this state to block for a jobflow to complete. Valid jobflow s tates currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNIN G, SHUTTING_DOWN, STARTING and WAITING.

	92

	93 In some cases you may not have built all of the steps prior to running the jobfl ow. In these cases additional steps can be added to a jobflow by running:

	94

	95 >>> conn.add_jobflow_steps(jobid, [second_step])

	96

	97 If you wish to add additional steps to a running jobflow you may want to set the keep_alive parameter to True in run_jobflow so that the jobflow does not automa tically terminate when the first step completes.

	98

	99 The run_jobflow method has a number of important parameters that are worth inves tigating. They include parameters to change the number and type of EC2 instance s on which the jobflow is executed, set a SSH key for manual debugging and enabl e AWS console debugging.

	100

	101 Terminating JobFlows

	102 --------------------

	103 By default when all the steps of a jobflow have finished or failed the jobflow t erminates. However, if you set the keep_alive parameter to True or just want to halt the execution of a jobflow early you can terminate a jobflow by:

	104

	105 >>> import boto

	106 >>> conn = boto.connect_emr()

	107 >>> conn.terminate_jobflow('<jobflow id>')

	108

OLD	NEW