Did you know that about Amazon Web Services?
Apache Spark is an open-source distributed processing engine used for big data workloads. It's a good fit for batch processing, streaming, graph databases and machine learning thanks to in-memory caching and optimized execution for fast performance, according to Amazon.
EMR supports Spark version 1.3.1 and utilizes Hadoop YARN as the cluster manager. Running Spark on top of EMR has been possible before, but the integrated support should make using the engine more straightforward. IT staff can create a cluster from the AWS Management Console, for example. Spark applications developed using Scala, Python, Java, and SQL can all run on EMR.
It has been a good week for proponents of Spark, with the launch of a new release, IBM getting behind it in a big way and Amazon now adding Spark on top of EMR.
Amazon and IBM will go head to head later this month, when IBM also starts offering a Spark service. The company said on Monday it will allow developers to build and run their own machine learning algorithms. IBM also said it has devoted 3,500 researchers and developers to help with Spark upkeep and further development.
Amazon's pricing is based on the cost of the underlying EC2 instances and a separate charge for the processing service.
Running Spark on EMR and a basic c3.xlarge instance costs US$0.263 per hour on-demand while using the more capable c3.8xlarge instance costs $1.95 per hour. There are also more expensive instances with lots of memory or storage to choose between (so-called memory and storage optimized instances). The individual prices then have to be multiplied by the number of nodes used.