EMR is a large-scale parallel distributed data processing tool; on-demand billing is the key benefit.
EMR can be used for huge-scale log analysis, indexing, machine learning, and other large-scale applications.
EMR is based on Apache Hadoop framework as an AWS-managed cluster of EC2 instances – such as Amazon Elastic Kubernetes Service (EKS) – for short-term or ad-hoc computing.
- It integrates open source big data tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi,and Presto.
- For long-term computing tasks, reserved instances can be used.
The main component is a cluster, which is a collection of nodes (EC2 instances).
- The master node manages the cluster, distributes workloads, and monitors health.
- The cluster can have zero or more core nodes, run tasks, and manage data for HDFS.
- Task nodes are optional and can be used to execute tasks. It does not store data in HDFS. If task nodes fail, a core node starts the task on another task node. -> Can be used with spot instances.
- S3 or HDFS is used for data storage for the cluster.
Log files on the master node can be configured to be saved in S3 (five-minute intervals) only when the cluster is created for the first time. When the cluster terminates, the log file is available in S3.