Install Apache Spark with IPython Notebook Configuration on Ubuntu 15
After reading a few useful posts and some debugging time, I successfully installed Apache Spark on my Ubuntu 15 machine. I also add PySpark in IPython Notebook for development purpose.
Many thanks to author of these articles:
How-to: Use IPython Notebook with Apache Spark, written by Uri Laserson.
Configuring IPython Notebook Support for PySpark, written by John Ramey.
Spark Tutorial (Part I): Setting Up Spark and IPython Notebook within 10 minutes, written by Yi Zhang.
Introduction to IPython Configuration
Install Python and IPython Notebook
I highly recommend you using Virtualenv or Anacona for Python package control.
You need to install Python and pip first (You can find installation documents through Google Search). Then in terminal, type following:
$pip install "ipython[notebook]"
If you want to install everything for IPython, including Notebook, type following in terminal:
$pip install "ipython[all]"
You can test your installation by type following command in terminal (Make sure you are in right environment):
$ipython notebook
Your default browser should open a window with IPython Notebook interface.
Install Spark and PySpark
You can download Spark from here. You can install either source code package, or a pre-build package. All packages can be installed and configured for IPython Notebook.
Pre-build package can be extracted and being used directly. If you download source package, you need to do this after extracting:
$cd your_spark_source_folder $sbt/sbt assembly
This process may take a while :)
You may need to install sbt first. For Linux user, please refer to this article: Installing sbt on Linux.
You can test your installation by doing following:
$cd your_spark_folder $./bin/pyspark
You need java jdk installed on your machine and JAVA_HOME has been set, otherwise PySpark will throw error for that, such as:
../images/articles/2015/python/hadoop2.6_java_home_dir_problem.png
Make sure your JAVA_HOME pointing to home folder of your desired JDK. For example, I want to use Oracle JDK 8. So I type
sudo update-alternatives --config java
in terminal to find preferred JDK path (JDK 8 in my case), then addexport JAVA_HOME=/usr/lib/jvm/java-8-oracle
in.bashrc
file.
You should see following in your terminal. Type sc
and you should see output as SparkContext
:
../images/articles/2015/python/test_pyspark.png
Configure PySpark and IPython Notebook
Based on my experience, PySpark has good support for Python 2.7, but not Python 3. I recommend you use Python 2.7 in this step.
Configuring IPython Notebook Support for PySpark, written by John Ramey, gave a very good description of steps you need to follow. However, you need to remember doing following:
In .bashrc
and 00-pyspark-setup.py
, make sure you have your own Spark folder name. Check your your_spark_folder/python/lib
for py4j version.
After you do ipython notebook --profile=pyspark
, you will open Jupyter web interface in your default browser, using the c.NotebookApp.port
in your .bashrc
file.
In Jupyter interface, You can upload one existing IPython Notebook to test your configuration by clicking Upload
button, Or create a new IPython Notebook by clicking new
button, then select Python 2
or Python 3
in drop-down menu.
In IPython Notebook interface, create a new cell and type sc
. When you run this cell, you should see SparkContext
object, same as following:
../images/articles/2015/python/test_pyspark_notebook.png
Now, if your IPython Notebook show message:
NameError: name 'sc' is not defined
, which means your IPython Notebook doesn't use PySpark profile. You can try typingipython --profile=pyspark
in terminal to make PySpark as default IPython profile, then tryipython notebook --profile=pyspark
again. PySpark should be available now.