Here I discuss how the standalone Spark cluster is setup in AWS using EC2.
Let’s assume we are setting up a 3 node standalone cluster. The ip address of each node say :
192.168.10.1 (m4.xlarge – $0.239 per Hour)
192.168.10.2 (m4.large – $0.12 per Hour)
192.168.10.3 (m4.large – $0.12 per Hour)
Each node has 100 GB EBS volume
Servers Info
Launch 3 servers with Centos 6.5 or 7 in AWS EC2 (US West – Oregon region)
Configure host name on each node and make sure each one of them is accessible to the other using hostname
We will make one node as master node and other two as worker nodes
Master node has a security group and worker nodes have another security group (configure firewall rules in both the security groups to run spark cluster smoothly)
Master node requires following ports to be opened for public:
8080 – Spark Web UI
7077 – Worker nodes connect to Driver program (master)
Cluster Deployment
Configurations on each node
Install Java 7 on each server
Place a compiled version of Spark on each node on the cluster
Install Jetty server in Master node
Install Maven in Master node to compile Spark Java code
Use “Cluster Launch Scripts” to bring up spark cluster
Cassandra setup
As part of pilot project we can use same servers to make multi-node cluster setup and scale appropriately based on performance
Install Cassandra on all nodes
Configure Cassandra instance at master node will act as seed node and other two are normal nodes