Setting up

Here is what we'll do:

Set up Python (Install Anaconda (actually miniconda) and necessary packages)
Set up Spark2
Configure Spark, findspark and pySpark

Note: The instructions have been tested only on Ubuntu and OS X. Please raise a issue if you hit any snags on your windows system.

Set up Python

Install Anaconda

From the Anaconda docs:

Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.

Overview

Using Anaconda consists of the following:

Install miniconda on your computer
Create a new conda environment
Each time you wish to work, activate your conda environment

Installation

Download the version of miniconda that matches your system. Make sure you download the version for Python 3.x (3.6 is the latest at the time of writing).

	Linux	Mac	Windows
64-bit	64-bit (bash installer)	64-bit (bash installer)	64-bit (exe installer)
32-bit	32-bit (bash installer)		32-bit (exe installer)

Install miniconda on your machine. Detailed instructions:

Linux: http://conda.pydata.org/docs/install/quick.html#linux-miniconda-install
Mac: http://conda.pydata.org/docs/install/quick.html#os-x-miniconda-install
Windows: http://conda.pydata.org/docs/install/quick.html#windows-miniconda-install

Install necessary packages

Setup the bdap environment.

git clone https://github.com/soumendra/learn-spark-python.git
cd learn-spark-python

You should already have a GitHub account, and should have installed git in your system to be able to clone the repo in the last instruction. If not, you can download the repository here, extract the folder, cd into it, and then continue.

If you are on Windows, rename meta_windows_patch.yml to meta.yml.

Create bdap. Running this command will create a new conda environment that is provisioned with all libraries listed above.

conda env create -f bdap.yml

Verify that the bdap environment was created in your environments:

conda info --envs

Cleanup downloaded libraries (remove tarballs, zip files, etc):

conda clean -tp

Uninstalling

To uninstall the environment:

conda env remove -n bdap

Using Anaconda and Jupyter Notebooks

Now that you have created an environment, in order to use it, you will need to activate the environment. This must be done each time you begin a new working session, i.e., open a new terminal window.

Activate the bdap environment:

OS X and Linux

$ source activate bdap

Windows

Depending on shell either:

$ source activate bdap

or

$ activate bdap

That's it. Now all of the bdap libraries are available to you. You can start a Jupyter Notebook with:

jupyter notebook

To exit the environment when you have completed your work session, simply close the terminal window.

Jupyter Notebook Basics

Get started with IPython (ignore the sections on installation)
You should understand the following concepts:
- Managing Cells
  - Cell Types
  - Moving Cells around
  - Executing and Creating New Cells
- Handling Notebooks
  - Creating new notebooks (with different kernels)
  - Exporting notebooks to various formats
- IPython Kernels
- Keyboard Shortcuts

IPython Homepage: https://ipython.org/

Just enough Markdown for Jupyter Notebooks

What is markdown?
GitHub Flavored Markdown (used by IPython)

Here are some examples of markdown in action (edit this cell to see the raw text)

Styling Text
- italicised
- bold
- italicised and bold
- styles mixed together
Unordered lists
- Nested Unordered Lists
1. Ordered Lists
2. Nested Lists
3. More Nesting * More nested lists
Writing Code Blocks

x = 0
x = 2 + 2
hist(rnorm(1000))

Set up Spark

Install Java

Check if Java sdk is installed in your system. If not, follow the following instructions (for Ubuntu only, mac folks can do brew install java and windows folks can just download and execute the binary)

sudo apt-get update
sudo apt-get -y upgrade

sudo apt-get install -y software-properties-common curl
sudo apt-get install default-jre default-jdk

Java Alternative (optional)

Please note that Java was installed in the last section, and don't actually need to go through this section to install Java.
This is for adventurous students who wish to install the jdk/jre from Oracle (if you don't know what I am talking about, you don't need to do it.

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update

sudo apt-get install oracle-java8-installer

If the Oracle version is the only one you have installed, then you have only one Java framework installed i your system. Otherwise, if you have multiple Java frameworks installed in your system, you can choose one of them with the following:

sudo update-alternatives --config java
sudo update-alternatives --config javac

And here is how you can set up your JAVA_HOME

Note in the install path from (the command above) sudo update-alternatives --config java
Add JAVA_HOME="YOUR_PATH" to /etc/environment source /etc/environment echo $JAVA_HOME

Configuring Spark, findspark and pySpark

source activate bdap

After executing that in the terminal, follow the instructions in these slides.

Git and GitHub (optional)

Signup for an account in https://github.com/ (remember your github username and password)
Setup GitHub (setup your local git installation for GitHub)

git config --global user.name "YourName"
git config --global user.email "YourGitHubEmailId"

Learn the basics of git
GitHub Workflow using web-ui
(Optional) Forking Repos and GitHub Fork Workflow
(Optional) Set up SSH authentication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setting-up.md

setting-up.md

Setting up

Set up Python

Install Anaconda

Overview

Installation

Install necessary packages

Uninstalling

Using Anaconda and Jupyter Notebooks

OS X and Linux

Windows

Jupyter Notebook Basics

Just enough Markdown for Jupyter Notebooks

Set up Spark

Install Java

Java Alternative (optional)

Configuring Spark, findspark and pySpark

Git and GitHub (optional)

Files

setting-up.md

Latest commit

History

setting-up.md

File metadata and controls

Setting up

Set up Python

Install Anaconda

Overview

Installation

Install necessary packages

Uninstalling

Using Anaconda and Jupyter Notebooks

OS X and Linux

Windows

Jupyter Notebook Basics

Just enough Markdown for Jupyter Notebooks

Set up Spark

Install Java

Java Alternative (optional)

Configuring Spark, findspark and pySpark

Git and GitHub (optional)