[AWS] Elastic MapReduce (EMR)

EMR is a large-scale parallel distributed data processing tool; on-demand billing is the key benefit.

Features

EMR can be used for huge-scale log analysis, indexing, machine learning, and other large-scale applications.

EMR is based on Apache Hadoop framework as an AWS-managed cluster of EC2 instances – such as Amazon Elastic Kubernetes Service (EKS) – for short-term or ad-hoc computing.

It integrates open source big data tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi,and Presto.
For long-term computing tasks, reserved instances can be used.

Components

The main component is a cluster, which is a collection of nodes (EC2 instances).

The master node manages the cluster, distributes workloads, and monitors health.
The cluster can have zero or more core nodes, run tasks, and manage data for HDFS.
Task nodes are optional and can be used to execute tasks. It does not store data in HDFS. If task nodes fail, a core node starts the task on another task node. -> Can be used with spot instances.
S3 or HDFS is used for data storage for the cluster.

Logging

Log files on the master node can be configured to be saved in S3 (five-minute intervals) only when the cluster is created for the first time. When the cluster terminates, the log file is available in S3.

[AWS] Elastic MapReduce (EMR)

Features

Components

Logging

Published by P. L.

Leave a Comment Cancel reply

Features

Components

Logging

Share this:

Related

Published by P. L.

Leave a Comment Cancel reply