-
Notifications
You must be signed in to change notification settings - Fork 16
/
deploy-instructions-2019.txt
209 lines (147 loc) · 8.87 KB
/
deploy-instructions-2019.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
########################################################################################################################
##### Spark deploy instructions - First step of A2 #####
# These are instructions for deploying Apache Spark.
# They include a hacks relevant for the SNIC cloud, which we wouldn't use for a production system.
# For the lab, a spark cluster has already been deployed -- this will be an experiment!
# 0. Check the Spark and HDFS cluster is operating by opening these links in your browser (on campus):
# http://130.238.29.245:8080
# http://130.238.29.245:50070
# 0. Create a virtual machine (or use an existing one), this instance will be used to connect to the Spark cluster as a client.
# Your virtual machine should be 'ssc.small' flavor.
# Use Ubuntu 18.04 LTS as the source image.
# 0. Add it to the 'spark-cluster-client' security group for it to work correctly with Spark.
# (The machines in the Spark cluster need to be able to connect to your VM)
# 0. Add a floating IP to the VM
# 0. Configure the ~/.ssh/config on your local laptop/desktop/lab computer like this.
# This is for the university lab machines or other unix-like systems and (Windows Subsystem for Linux) WSL, you may have
# to modify the instructions if you are using some other system):
# replace 130.238.x.y and ~/.ssh/id_rsa with your floating IP and key path appropriately.
Host 130.238.x.y
User ubuntu
IdentityFile ~/.ssh/id_rsa
LocalForward 8888 localhost:8888
LocalForward 4040 localhost:4040
LocalForward 4041 localhost:4041
LocalForward 4042 localhost:4042
LocalForward 4043 localhost:4043
# This will open ssh-tunnels (for the given ports) between your local computer and the VM. This will enable you to access e.g a jupyter
# notebook via your browser by localhost:8888. Note: the first port listed is the port on your local machine, localhost:XXXX refers to the remote machine.
# Note: having setup the config file, when you connect with SSH do it like this (without any additional parameters).
# ssh 130.238.x.y
#####################
### IIRC these instructions are for Ubuntu 16.04
# Hack to fix issue with ubuntu packages
sudo sed -ie 's/nova.clouds.archive.ubuntu.com/se.archive.ubuntu.com/' /etc/apt/sources.list
## For this example, we'll install Spark worker and master on the same virual machine. Normally we'd put the master on its own machine.
# update apt repo metadata
sudo apt update
# install java
sudo apt-get install -y openjdk-8-jdk
# manually define a hostname for all the hosts on the ldsa project. this will make networking easier with spark:
# NOTE! if you have added entries to /etc/hosts yourself, you need to remove those.
for i in {1..255}; do echo "192.168.1.$i host-192-168-1-$i-ldsa" | sudo tee -a /etc/hosts; done
for i in {1..255}; do echo "192.168.2.$i host-192-168-2-$i-ldsa" | sudo tee -a /etc/hosts; done
# set the hostname according to the scheme above:
sudo hostname host-$(hostname -I | awk '{$1=$1};1' | sed 's/\./-/'g)-ldsa ; hostname
# ... the next steps depend on what we want this node to be... a master, a worker, or run a Python notebook...
# For the lab, we'll just create a Python notebook, and use the existing Spark cluster.
########################################################################################################################
##### Install HDFS (e.g. to connect to cluster) -- !!!SKIP THIS FOR THE LAB SESSION!!! Proceed to 'Install the Python Notebook...') #####
# This section describes how to install Hadoop to allow you to add your own files to the Hadoop Cluster.
# We assume we've already installed Java above (OpenJDK).
# HDFS needs to know where Java is... we use an environment variable for this.
# Lets put it in the .bashrc file so we don't need to set it each time
echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre" >> ~/.bashrc
# Run the file so its available now.
source ~/.bashrc
# Download hadoop:
cd ~
wget http://apache.mirrors.spacedump.net/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar -xvf hadoop-2.7.7.tar.gz
# Now we can use HDFS to copy files in and out of the storage cluster for use with Spark:
# The namenode host can be set in a config file, or as here, directly on the command line.
~/hadoop-2.7.7/bin/hdfs dfs -Dfs.defaultFS="hdfs://192.168.1.153:9000" -ls /
# Notes
# 1. Please don't attempt to connect your namenodes/datanodes to the existing cluster!
# 1. This is a shared resource, use it sensibly. Please keep an eye on disk space.
# 1. There is a folder for your team. Don't interfere with data from the other teams.
########################################################################################################################
##### Start Spark Master/Worker -- !!!SKIP THIS FOR THE LAB SESSION!!! Proceed to 'Install the Python Notebook...') #####
cd ~
# download Spark
# (Get download link for Hadoop 2.7)
wget https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz
tar -zxvf spark-2.4.2-bin-hadoop2.7.tgz
# Tell Spark that we're using Python 3 -- we need to use the same version of Python everywhere.
echo "export SPARK_HOME=~/spark-2.4.2-bin-hadoop2.7" >> ~/.bashrc
source ~/.bashrc
cd ~/spark-2.4.2-bin-hadoop2.7/
# lets have a look at some of the spark directories..
ls -l
# start the master on the current machine:
~/spark-2.4.2-bin-hadoop2.7/sbin/start-master.sh
# -or-
# start the worker on the current machine (and tell it where the master is listening -- the IP address of our master node from above):
~/spark-2.4.2-bin-hadoop2.7/sbin/start-slave.sh spark://192.168.X.Y:7077
netstat -tna
# is it running?
jps
# After this, you will need to setup security groups for the firewall configuration. Have a look how its done for the shared cluster.
########################################################################################################################
##### Install the Python Notebook -- RESUME HERE FOR THE LAB #####
# Env variable so the workers know which Python to use...
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
source ~/.bashrc
# install git
sudo apt-get install -y git
# install python dependencies, start notebook
# install the python package manager 'pip' -- it is recommended to do this directly
sudo apt-get install -y python3-pip
# this is a very old version of pip:
python3 -m pip --version
# upgrade it
python3 -m pip install pip
# install jupyter (installing via pip seems to be broken)
sudo apt install -y jupyter-notebook
# install pyspark (the matching version as the cluster), jupyter, and some other useful deps
python3 -m pip install pyspark==2.4.2 --user
python3 -m pip install pandas --user
python3 -m pip install matplotlib --user
# clone the examples from the lectures, so we have a copy to experiment with
git clone https://github.com/benblamey/jupyters-public.git
# start the notebook!
jupyter notebook
# Follow the instructions you see:
#
# Copy/paste this URL into your browser when you connect for the first time,
# to login with a token:
# http://localhost:8888/?token=8af4be03b08713c66d8d093a7d684108c69c86f5b63dd
# Now you can run the examples from the lectures in your own notebook.
# Start with ldsa-2019/Lecture1_Example1_ArraySquareandSum.ipynb
# You'll need to change the host name for the Spark master, and namenode, to:
# 192.168.1.153
# When you start your application, you'll see it running in the Spark master web GUI (link at the top).
# If you hover over the link to your application, you'll see the port number for the Web GUI for your application.
# It will be 4040,4041,...
# You can open the GUI in your web browser like this (e.g.):
# http://localhost:4040
########################################################################################################################
##### Creating your own notebook that deploys spark jobs to the cluster #####
# When working on your own notebooks, save them in your own git repository (which you created in A1, do a git clone) and
# make sure to commit and push changes often (for backup purposes).
# You need to share the Spark cluster with the other students:
# 1. Start your application with dynamic allocation enabled, a timeout of no more than 30 seconds, and a cap on CPU cores:
#spark_session = SparkSession\
# .builder\
# .master("spark://master:7077") \
# .appName("blameyben_lecture1_simple_example")\
# .config("spark.dynamicAllocation.enabled", True)\
# .config("spark.shuffle.service.enabled", True)\
# .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
# .config("spark.executor.cores",4)\
# .getOrCreate()
# 2. Put your name in the name of your application.
# 3. Kill your application when your have finished with it.
# 4. Don't interfere with any of the virtual machines in the cluster. Don't attempt to add your own machine to the cluster.
# 5. Run one app at a time.
# 6. When the lab is not running, you can use more resources, but keep an eye on other people using the system.