Professional Documents
Culture Documents
Following 'Infrastructure as Code' rules this is a full robust project which describes EMR cluster deploying and run Hive script there.
It describes Analyze Big Data with Hadoop project from AWS 'Learn to Build' section.
Introduction
It's important to describe your infrastructure with a code. Terraform can help us with that.
Authentication
Don't forget to create variables.tf file in your project root directory where you should set 3 variables:
variable "region" {
default = "us-east-2"
}
variable "access_key" {
default = "JFSKLGD8...UFDJKGJS"
}
variable "secret_key" {
default = "sdfs8d9fgEG33VE...343rVFDV3vdfevr"
}
I'd like to share this experience because I faced a couple of not trivial issues.
provider "aws" {
access_key = "${var.access_key}"
secret_key = "${var.secret_key}"
region = "${var.region}"
}
2. Here, we should create S3 Bucket and EC2 Key Pair. Both are quite simple and straightforward steps which are described
at s3_bucket.tf and key_pair.tf correspondingly.
https://www.codeproject.com/Tips/1262916/AWS-Analyze-Big-Data-with-Terraform?display=Print 1/3
10/10/2018 AWS Analyze Big Data with Terraform - CodeProject
3. Creating EMR cluster via the console needs 5-7 clicks choosing a couple of options and the rest of the options can be left
by default. It looks like an apple pie but in fact a lot of actions are happening behind the scenes. So we have to take care
about the roles and policies for EMR and its EC2 instances. For each of them, we have to create 2 data objects
(aws_iam_policy and aws_iam_policy_document) and 2 resources
(aws_iam_role_policy_attachment and aws_iam_role). These roles are at roles.tf module.
4. Another important section is about network and security (vpc.tf). Here, we're creating 6 resources:
aws_vpc;
aws_subnet and aws_internet_gateway at this vpc;
aws_route_table at this vpc which has a route via created internet gateway;
aws_main_route_table_association which connects our vpc and route table;
aws_security_group at our vpc which depends on created subnet.
5. We also need an aws_iam_instance_profile which is kept at the end of emr_cluster.tf module.
6. Finally, we can create EMR cluster itself emr_cluster.tf. We should describe here all required properties such as: name,
release_label, applications, service_role (from step 3), log_uri (from step 2),
ec2_attributes (from steps 2, 4, 5), one or more instance groups. I also added there 'step' section where I put Hive-
script to execute.
I will really appreciate any comments or suggestions about how this script could be simplified.
Points of Interest
It's not obvious how many resources are really created behind the scenes when you click the button to create EMR cluster at AWS
Console. But it's useful to know to understand underlying things that are happening there.
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile Article Copyright 2018 by YegorDovganich
Web05-2016 | 2.8.180920.1 | Last Updated 10 Oct 2018 Everything else Copyright © CodeProject, 1999-2018
https://www.codeproject.com/Tips/1262916/AWS-Analyze-Big-Data-with-Terraform?display=Print 3/3