Installing Spark Standalone to a Cluster
#출처
http://spark.apache.org/docs/latest/spark-standalone.html
□ Place pre-built versions of Spark (All Server)
- To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.
- 설치 위치에 파일 배포 ( /platform/package/spark )
> cd /platform/package
> tar -zxvf /platform/temp/spark/spark-2.1.0-bin-hadoop2.6.tgz
> ln -s spark-2.1.0-bin-hadoop2.6 spark
□ SSH KEY 설정 (All Server)
- Master 와 Worker간 자동 SSH접속을 위해 Key를 생성하여 cluster간 통신이 가능하도록 설정한다.
- Note, the master machine accesses each of the worker machines via ssh.
By default, ssh is run in parallel and requires password-less (using a private key) access to be setup.
If you do not have a password-less setup,
you can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.
- 참조 : http://bigdata-architect.tistory.com/8
□ SPARK CONFIG (All Server)
- sudo vim /etc/profile
> SPARK_HOME="/platform/package/spark"
> export SPARK_HOME
- source /etc/profile
- echo $SPARK_HOME
□ Cluster Launch Scripts(Master Server)
- Spark Standalone Cluster를 시작 / 중지 하기 위하여 Master 서버에서 Launch Scripts를 사용한다.
- Slave Node 목록(conf/slaves) 을 작성한다.
> cd /platform/package/spark/conf
> cp slaves.template slaves
- sbin/start-all.sh, sbin/stop-all.sh 로 Cluster를 시작 / 중지 한다.
> $SPARK_HOME/sbin/start-all.sh
> $SPARK_HOME/sbin/stop-all.sh
- To launch a Spark standalone cluster with the launch scripts,
you should create a file called conf/slaves in your Spark directory,
which must contain the hostnames of all the machines where you intend to start Spark workers, one per line.
If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which is useful for testing.
- Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts,
based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin:
sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.
Note that these scripts must be executed on the machine you want to run the Spark master on,
not your local machine.
- You can optionally configure the cluster further by setting environment variables in conf/spark-env.sh.
Create this file by starting with the conf/spark-env.sh.template,
and copy it to all your worker machines for the settings to take effect. The following settings are available:
□ Spark Cluster 실행 확인
- an interactive Spark shell
$SPARK_HOME/bin/spark-shell --master spark://migaloo01:7077
- Master WEB UI
http://migaloo01:8080 , http://192.168.10.101:8080
- Worker WEB UI
http://migaloo02:8081 , http://192.168.10.102:8081
http://migaloo03:8081 , http://192.168.10.103:8081
http://migaloo04:8081 , http://192.168.10.104:8081
http://migaloo05:8081 , http://192.168.10.105:8081
□ Launching Spark Application On YARN
- spark shell
> $SPARK_HOME/bin/spark-shell #실행됨
> $SPARK_HOME/bin/spark-shell --master spark://migaloo01:7077 #실행 중 죽음
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client #실행안됨
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode cluster #지원안함
- spark application
> hdfs dfs -ls /
> hdfs dfs -mkdir /platform
> hdfs dfs -mkdir /platform/temp/
> hdfs dfs -put /platform/temp/UserPurchaseHistory.csv /platform/temp
> hdfs dfs -ls /platform/temp/
# Hadoop Resource Manager에는표시되지 않지만 실제로는 처리
> $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master spark://migaloo01:7077 /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar
> $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar
> $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master yarn --deploy-mode client /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar
running spark application.txt
# Hadoop Resource Manager에는 ACCEPTED 되나 실제 처리되지 않음
> $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master yarn --deploy-mode cluster --supervise /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar
> $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master yarn --deploy-mode cluster /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar
□ 메모리 오류 발생 시 조치
- 메모리가 1G인 라즈베리에서 Spark Shell 실행 시 메모리 오류가 발생할 수 있다.
- 메모리에 저장할 공간이 모자랄 때 발생
- memory overcommit을 허용해주면 된다.
- sudo sysctl vm.overcommit_memory=1
- 일반 커맨드 창에서 실행하면 됨
migaloo@migaloo01:/platform/package/spark$ ./bin/spark-shell --master spark://migaloo01:7077
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Java HotSpot(TM) Client VM warning: You have loaded library /platform/package/hadoop-2.6.5/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
17/01/22 10:52:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/01/22 10:52:45 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/platform/package/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/platform/package/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/01/22 10:52:45 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/platform/package/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/platform/package/spark/jars/datanucleus-rdbms-3.2.9.jar."
17/01/22 10:52:45 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/platform/package/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/platform/package/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."
Java HotSpot(TM) Client VM warning: INFO: os::commit_memory(0x555c2000, 129228800, 0) failed; error='메모리를 할당할 수 없습니다' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 129228800 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /platform/package/spark-2.1.0-bin-hadoop2.6/hs_err_pid2559.log
hs_err_pid2559.log