Aws emr là gì

WHITEPAPER

The Economic Benefits of Migrating Apache Spark & Hadoop to Amazon EMR

Apache open-source projects like Hadoop & Spark have accelerated insights collection by automating the storage, processing, and accessing of big data. IDC studied nine companies that leveraged Amazon Elastic MapReduce (EMR) to run big data frameworks at scale and found that on average, they lowered total cost of ownership by 57% and experienced a 99% reduction in unplanned downtime.

In this whitepaper, IDC explores the business impact of Amazon EMR with impressive results, including an average 342% five-year ROI. Download the whitepaper to learn more about how customers use Amazon EMR to generate value by:

  • Deploying a flexible, elastic, and scalable cloud environment to reducing physical infrastructure costs.
  • Driving higher IT staff productivity among teams that need to manage and support these environments.
  • Providing stronger big data environment availability which enables better productivity among end users.

Aws emr là gì

Aws emr là gì

AWS EMR

AWS EMR is a Big Data framework for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase. AWS EMR it’s completely integrated with AWS Big Data ecosystem, in particular with S3 Bucket for the data storage. It’s one of the most used service on AWS, related to Big Data Platform, thanks to its ease and reliable functions based on Clusters.
Clusters are collection of Elastic Compute Cloud (Amazon EC2) instances, and every instance is called a node.

NODE TYPES

The node types in Amazon EMR are as follows:
• Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.

• Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

• Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional

BENEFITS


DATA REPLY BEST PRACTICE

Data Reply, AWS Premier Consulting Partner, has developed a strong expertise on AWS Big Data platform implementation. During this time, we gain expertise in the use of AWS EMR, which can guarantee reliability and cost saving in its use.

Governance

AWS EMR governance is possible using a centralized dashboard that provide customer the possibility to manage (create, delete, scale, configure,.. ) clusters, allowing users to have always a clear vision of costs and power of the cluster. Moreover, using EMR in collaboration with Glue, it’s possible to create a centralized Data Catalog, where you can consume metadata associated to data and tables used by EMR.

Logging

AWS EMR is totally integrated with AWS Cloudwatch. Thanks to this feature, we can collect logs and metrics related to EMR, and use them to constantly monitor the pipelines.

Costs

An advantage that brings AWS EMR is the possibility to use Spot Instances. Spot Instances are unused Amazon EC2 capacity that you bid on; the price you pay is determined by the supply and demand for Spot Instances. The cost of using Spot Instances can be 80% less than using On-Demand Instances. Not all the workloads can be executed on Spot Instances, in this case we can use On-demand machines, that can be shared among several small jobs or teams.

DATA REPLY MIGRATION APPROACH

As Data Reply we provide our expertise in AWS Migration, built in different industrial sectors, among several years of projects. We distilled our expertise into our Migration Approach, that consists in 4 modules which can be combined and selected depending on the maturity level of the customer:

REQUIREMENTS & BUSINESS USE CASE

Understand key business challenges and goals, in order to identify gaps and opportunity, and plan current and future state

TECHNICAL WORKSHOP

During workshop phase, we perform a Technical & Opportunity assessment, planning technical deep dive session, in order to identify migration success criteria, business & IT Data Lake outcomes

PILOT

Scope of the Pilot phase is to create a simple Pilot of the target solution, in order to allow customers to have a concrete way to test the solution. We define target architecture & component level mapping, according to requirements collected in previous phases, and execute incremental data migration and automation. After the UAT step, the Pilot is ready to Go Live!

IMPLEMENTATION

Finally, the phase where we implement the final solution, split into Waves that guarantee a continue release of the solution. We define full migration strategy and schedule, and the application code migration with Dual Target approach. Later, we can start with waves of implementation, including bulk import/export and the validation and audit of the solution. After the Test & UAT phases, we are ready for a successful GO LIVE!