OLD | NEW |
(Empty) | |
| 1 .. _emr_tut: |
| 2 |
| 3 ===================================================== |
| 4 An Introduction to boto's Elastic Mapreduce interface |
| 5 ===================================================== |
| 6 |
| 7 This tutorial focuses on the boto interface to Elastic Mapreduce from |
| 8 Amazon Web Services. This tutorial assumes that you have already |
| 9 downloaded and installed boto. |
| 10 |
| 11 Creating a Connection |
| 12 --------------------- |
| 13 The first step in accessing Elastic Mapreduce is to create a connection |
| 14 to the service. There are two ways to do this in boto. The first is: |
| 15 |
| 16 >>> from boto.emr.connection import EmrConnection |
| 17 >>> conn = EmrConnection('<aws access key>', '<aws secret key>') |
| 18 |
| 19 At this point the variable conn will point to an EmrConnection object. |
| 20 In this example, the AWS access key and AWS secret key are passed in to |
| 21 the method explicitly. Alternatively, you can set the environment variables: |
| 22 |
| 23 AWS_ACCESS_KEY_ID - Your AWS Access Key ID \ |
| 24 AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key |
| 25 |
| 26 and then call the constructor without any arguments, like this: |
| 27 |
| 28 >>> conn = EmrConnection() |
| 29 |
| 30 There is also a shortcut function in the boto package called connect_emr |
| 31 that may provide a slightly easier means of creating a connection: |
| 32 |
| 33 >>> import boto |
| 34 >>> conn = boto.connect_emr() |
| 35 |
| 36 In either case, conn points to an EmrConnection object which we will use |
| 37 throughout the remainder of this tutorial. |
| 38 |
| 39 Creating Streaming JobFlow Steps |
| 40 -------------------------------- |
| 41 Upon creating a connection to Elastic Mapreduce you will next |
| 42 want to create one or more jobflow steps. There are two types of steps, streami
ng |
| 43 and custom jar, both of which have a class in the boto Elastic Mapreduce impleme
ntation. |
| 44 |
| 45 Creating a streaming step that runs the AWS wordcount example, itself written in
Python, can be accomplished by: |
| 46 |
| 47 >>> from boto.emr.step import StreamingStep |
| 48 >>> step = StreamingStep(name='My wordcount example', |
| 49 ... mapper='s3n://elasticmapreduce/samples/wordcount/wordSp
litter.py', |
| 50 ... reducer='aggregate', |
| 51 ... input='s3n://elasticmapreduce/samples/wordcount/input', |
| 52 ... output='s3n://<my output bucket>/output/wordcount_outpu
t') |
| 53 |
| 54 where <my output bucket> is a bucket you have created in S3. |
| 55 |
| 56 Note that this statement does not run the step, that is accomplished later when
we create a jobflow. |
| 57 |
| 58 Additional arguments of note to the streaming jobflow step are cache_files, cach
e_archive and step_args. The options cache_files and cache_archive enable you t
o use the Hadoops distributed cache to share files amongst the instances that ru
n the step. The argument step_args allows one to pass additional arguments to H
adoop streaming, for example modifications to the Hadoop job configuration. |
| 59 |
| 60 Creating Custom Jar Job Flow Steps |
| 61 ---------------------------------- |
| 62 |
| 63 The second type of jobflow step executes tasks written with a custom jar. Creat
ing a custom jar step for the AWS CloudBurst example can be accomplished by: |
| 64 |
| 65 >>> from boto.emr.step import JarStep |
| 66 >>> step = JarStep(name='Coudburst example', |
| 67 ... jar='s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar
', |
| 68 ... step_args=['s3n://elasticmapreduce/samples/cloudburst/input/s
_suis.br', |
| 69 ... 's3n://elasticmapreduce/samples/cloudburst/input/1
00k.br', |
| 70 ... 's3n://<my output bucket>/output/cloudfront_output
', |
| 71 ... 36, 3, 0, 1, 240, 48, 24, 24, 128, 16]) |
| 72 |
| 73 Note that this statement does not actually run the step, that is accomplished la
ter when we create a jobflow. Also note that this JarStep does not include a ma
in_class argument since the jar MANIFEST.MF has a Main-Class entry. |
| 74 |
| 75 Creating JobFlows |
| 76 ----------------- |
| 77 Once you have created one or more jobflow steps, you will next want to create an
d run a jobflow. Creating a jobflow that executes either of the steps we create
d above can be accomplished by: |
| 78 |
| 79 >>> import boto |
| 80 >>> conn = boto.connect_emr() |
| 81 >>> jobid = conn.run_jobflow(name='My jobflow', |
| 82 ... log_uri='s3://<my log uri>/jobflow_logs', |
| 83 ... steps=[step]) |
| 84 |
| 85 The method will not block for the completion of the jobflow, but will immediatel
y return. The status of the jobflow can be determined by: |
| 86 |
| 87 >>> status = conn.describe_jobflow(jobid) |
| 88 >>> status.state |
| 89 u'STARTING' |
| 90 |
| 91 One can then use this state to block for a jobflow to complete. Valid jobflow s
tates currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNIN
G, SHUTTING_DOWN, STARTING and WAITING. |
| 92 |
| 93 In some cases you may not have built all of the steps prior to running the jobfl
ow. In these cases additional steps can be added to a jobflow by running: |
| 94 |
| 95 >>> conn.add_jobflow_steps(jobid, [second_step]) |
| 96 |
| 97 If you wish to add additional steps to a running jobflow you may want to set the
keep_alive parameter to True in run_jobflow so that the jobflow does not automa
tically terminate when the first step completes. |
| 98 |
| 99 The run_jobflow method has a number of important parameters that are worth inves
tigating. They include parameters to change the number and type of EC2 instance
s on which the jobflow is executed, set a SSH key for manual debugging and enabl
e AWS console debugging. |
| 100 |
| 101 Terminating JobFlows |
| 102 -------------------- |
| 103 By default when all the steps of a jobflow have finished or failed the jobflow t
erminates. However, if you set the keep_alive parameter to True or just want to
halt the execution of a jobflow early you can terminate a jobflow by: |
| 104 |
| 105 >>> import boto |
| 106 >>> conn = boto.connect_emr() |
| 107 >>> conn.terminate_jobflow('<jobflow id>') |
| 108 |
OLD | NEW |