Python 개발환경 설정


□ 아나콘다 배포판 설치 

  - https://www.continuum.io/downloads

  - Anaconda 4.3.0 For Windows  

  - Python 3.6 version 32-BIT INSTALLER (348M)

  - 64-BIT 버전에 오류가 있으니 일단 32-BIT로 설치함

  - 설치시 Install Type 선택에서 반드시 Just Me 를 선택함

  

□ PyCharm IDE 프로그램 설치   

  - https://www.jetbrains.com/pycharm/download//#section=windows

  - Community Lightweight IDE for Python & Scientific development (무료) 

  - 설치후 키보드 단축키와 IDE 테마 선택 시 아래 항목 추전 

    > Keymap Schemem : Visual Studio

    > IDE theme : Darcula

    > Editor colors and fonts : Monokai


  - 신규 프로젝트 생성시 Interpreter를 방금전에 설치한 Anaconda3의 python.exe를 선택한다. 



□ PyCharm IDE 환결설정

  - File > Settings 에서 폰트 등 환경 설정함


'Language > Python' 카테고리의 다른 글

Exception hierarchy  (0) 2017.01.19


Installing Spark Standalone to a Cluster



#출처

http://spark.apache.org/docs/latest/spark-standalone.html


□ Place pre-built versions of Spark (All Server)

  - To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. 

  - 설치 위치에 파일 배포 ( /platform/package/spark )

    > cd /platform/package

    > tar -zxvf /platform/temp/spark/spark-2.1.0-bin-hadoop2.6.tgz

    > ln -s spark-2.1.0-bin-hadoop2.6 spark  


□ SSH KEY 설정 (All Server)

  - Master 와 Worker간 자동 SSH접속을 위해 Key를 생성하여 cluster간 통신이 가능하도록 설정한다.

  - Note, the master machine accesses each of the worker machines via ssh. 

    By default, ssh is run in parallel and requires password-less (using a private key) access to be setup. 

    If you do not have a password-less setup, 

    you can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.

  - 참조 : http://bigdata-architect.tistory.com/8


□ SPARK CONFIG (All Server)

  - sudo vim /etc/profile

    > SPARK_HOME="/platform/package/spark"

    > export SPARK_HOME

  - source /etc/profile

  - echo $SPARK_HOME


□ Cluster Launch Scripts(Master Server)

  - Spark Standalone Cluster를 시작 / 중지 하기 위하여 Master 서버에서 Launch Scripts를 사용한다. 

  - Slave Node 목록(conf/slaves) 을 작성한다. 

    > cd /platform/package/spark/conf

    > cp slaves.template slaves 



  - sbin/start-all.sh, sbin/stop-all.sh 로 Cluster를 시작 / 중지 한다. 

    > $SPARK_HOME/sbin/start-all.sh

    > $SPARK_HOME/sbin/stop-all.sh    

  - To launch a Spark standalone cluster with the launch scripts, 

    you should create a file called conf/slaves in your Spark directory, 

    which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. 

    If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which is useful for testing. 

  - Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, 

    based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin:


sbin/start-master.sh - Starts a master instance on the machine the script is executed on.

sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.

sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.

sbin/start-all.sh - Starts both a master and a number of slaves as described above.

sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.

sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.

sbin/stop-all.sh - Stops both the master and the slaves as described above.

   

      Note that these scripts must be executed on the machine you want to run the Spark master on, 

 not your local machine.


  - You can optionally configure the cluster further by setting environment variables in conf/spark-env.sh. 

     Create this file by starting with the conf/spark-env.sh.template, 

     and copy it to all your worker machines for the settings to take effect. The following settings are available:



     

□ Spark Cluster 실행 확인

  - an interactive Spark shell 

    $SPARK_HOME/bin/spark-shell --master spark://migaloo01:7077

  - Master WEB UI 

    http://migaloo01:8080 , http://192.168.10.101:8080 

  - Worker WEB UI 

    http://migaloo02:8081 , http://192.168.10.102:8081 

    http://migaloo03:8081 , http://192.168.10.103:8081 

    http://migaloo04:8081 , http://192.168.10.104:8081 

    http://migaloo05:8081 , http://192.168.10.105:8081        




    

□ Launching Spark Application On YARN

  - spark shell 

    > $SPARK_HOME/bin/spark-shell  #실행됨

    > $SPARK_HOME/bin/spark-shell --master spark://migaloo01:7077     #실행 중 죽음

    > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client  #실행안됨

    > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode cluster #지원안함   

  - spark application

    > hdfs dfs -ls /

    > hdfs dfs -mkdir /platform    

    > hdfs dfs -mkdir /platform/temp/

    > hdfs dfs -put /platform/temp/UserPurchaseHistory.csv /platform/temp

    > hdfs dfs -ls /platform/temp/

    # Hadoop Resource Manager에는표시되지 않지만 실제로는 처리

    > $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master spark://migaloo01:7077 /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar

    > $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar

    > $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master yarn --deploy-mode client /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar

running spark application.txt


    # Hadoop Resource Manager에는 ACCEPTED 되나 실제 처리되지 않음

    > $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master yarn --deploy-mode cluster --supervise  /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar

    > $SPARK_HOME/bin/spark-submit --class org.migaloo.spark.test.ScalaApp --master yarn --deploy-mode cluster /platform/temp/org.migaloo.spark.test-0.0.1-SNAPSHOT.jar




□ 메모리 오류 발생 시 조치 

  - 메모리가 1G인 라즈베리에서 Spark Shell 실행 시 메모리 오류가 발생할 수 있다. 

  - 메모리에 저장할 공간이 모자랄 때 발생

  - memory overcommit을 허용해주면 된다.

  - sudo sysctl vm.overcommit_memory=1

  - 일반 커맨드 창에서 실행하면 됨


migaloo@migaloo01:/platform/package/spark$ ./bin/spark-shell --master spark://migaloo01:7077

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Java HotSpot(TM) Client VM warning: You have loaded library /platform/package/hadoop-2.6.5/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

17/01/22 10:52:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

17/01/22 10:52:45 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/platform/package/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/platform/package/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."

17/01/22 10:52:45 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/platform/package/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/platform/package/spark/jars/datanucleus-rdbms-3.2.9.jar."

17/01/22 10:52:45 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/platform/package/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/platform/package/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar."

Java HotSpot(TM) Client VM warning: INFO: os::commit_memory(0x555c2000, 129228800, 0) failed; error='메모리를 할당할 수 없습니다' (errno=12)

#

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (mmap) failed to map 129228800 bytes for committing reserved memory.

# An error report file with more information is saved as:

# /platform/package/spark-2.1.0-bin-hadoop2.6/hs_err_pid2559.log

hs_err_pid2559.log





Spark Standalone: Differences between client and cluster deploy modes


#출처 

http://stackoverflow.com/questions/37027732/spark-standalone-differences-between-client-and-cluster-deploy-modes

http://spark.apache.org/docs/latest/submitting-applications.html



Client:


Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.

Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).

Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.

If the driver process dies, you need an external monitoring system to reset it's execution.


Cluster:


Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader

Driver runs as a dedicated, standalone process inside the Worker.

Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).

Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.



# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \ # --master yarn
  --deploy-mode cluster \ # --deploy-mode client
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000



+ Recent posts