Install Apache Spark with IPython Notebook Configuration on Ubuntu 15
After reading a few useful posts and some debugging time, I successfully installed Apache Spark on my Ubuntu 15 machine. I also add PySpark in IPython Notebook for development purpose.
Many thanks to author of these articles:
How-to: Use IPython Notebook with Apache Spark, written by Uri Laserson.
Configuring IPython Notebook Support for PySpark, written by John Ramey.
Spark Tutorial (Part I): Setting Up Spark and IPython Notebook within 10 minutes, written by Yi Zhang.
Install Python and IPython Notebook
You need to install Python and pip first (You can find installation documents through Google Search). Then in terminal, type following:
$pip install "ipython[notebook]"
If you want to install everything for IPython, including Notebook, type following in terminal:
$pip install "ipython[all]"
You can test your installation by type following command in terminal (Make sure you are in right environment):
Your default browser should open a window with IPython Notebook interface.
Install Spark and PySpark
You can download Spark from here. You can install either source code package, or a pre-build package. All packages can be installed and configured for IPython Notebook.
Pre-build package can be extracted and being used directly. If you download source package, you need to do this after extracting:
$cd your_spark_source_folder $sbt/sbt assembly
This process may take a while :)
You may need to install sbt first. For Linux user, please refer to this article: Installing sbt on Linux.
You can test your installation by doing following:
$cd your_spark_folder $./bin/pyspark
You need java jdk installed on your machine and JAVA_HOME has been set, otherwise PySpark will throw error for that, such as:
Make sure your JAVA_HOME pointing to home folder of your desired JDK. For example, I want to use Oracle JDK 8. So I type
sudo update-alternatives --config javain terminal to find preferred JDK path (JDK 8 in my case), then add
You should see following in your terminal. Type
sc and you should see output as
Configure PySpark and IPython Notebook
Based on my experience, PySpark has good support for Python 2.7, but not Python 3. I recommend you use Python 2.7 in this step.
Configuring IPython Notebook Support for PySpark, written by John Ramey, gave a very good description of steps you need to follow. However, you need to remember doing following:
00-pyspark-setup.py, make sure you have your own Spark folder name. Check your
your_spark_folder/python/lib for py4j version.
After you do
ipython notebook --profile=pyspark, you will open Jupyter web interface in your default browser, using the
c.NotebookApp.port in your
In Jupyter interface, You can upload one existing IPython Notebook to test your configuration by clicking
Upload button, Or create a new IPython Notebook by clicking
new button, then select
Python 2 or
Python 3 in drop-down menu.
In IPython Notebook interface, create a new cell and type
sc. When you run this cell, you should see
SparkContext object, same as following:
Now, if your IPython Notebook show message:
NameError: name 'sc' is not defined, which means your IPython Notebook doesn't use PySpark profile. You can try typing
ipython --profile=pysparkin terminal to make PySpark as default IPython profile, then try
ipython notebook --profile=pysparkagain. PySpark should be available now.