[AWS] Elastic MapReduce (EMR)

EMR is a large-scale parallel distributed data processing tool; on-demand billing is the key benefit.


Features

EMR can be used for huge-scale log analysis, indexing, machine learning, and other large-scale applications.

EMR is based on Apache Hadoop framework as an AWS-managed cluster of EC2 instances – such as Amazon Elastic Kubernetes Service (EKS) – for short-term or ad-hoc computing.

  • It integrates open source big data tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi,and Presto.
  • For long-term computing tasks, reserved instances can be used.

Components

The main component is a cluster, which is a collection of nodes (EC2 instances).

  • The master node manages the cluster, distributes workloads, and monitors health.
  • The cluster can have zero or more core nodes, run tasks, and manage data for HDFS.
  • Task nodes are optional and can be used to execute tasks. It does not store data in HDFS. If task nodes fail, a core node starts the task on another task node. -> Can be used with spot instances.
  • S3 or HDFS is used for data storage for the cluster.

Logging

Log files on the master node can be configured to be saved in S3 (five-minute intervals) only when the cluster is created for the first time. When the cluster terminates, the log file is available in S3.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s