Here is what we'll do:
- Set up Python (Install Anaconda (actually miniconda) and necessary packages)
- Set up Spark2
- Configure Spark, findspark and pySpark
Note: The instructions have been tested only on Ubuntu and OS X. Please raise a issue if you hit any snags on your windows system.
From the Anaconda docs:
Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software.
Using Anaconda consists of the following:
- Install
miniconda
on your computer - Create a new
conda
environment - Each time you wish to work, activate your
conda
environment
Download the version of miniconda
that matches your system. Make sure you download the version for Python 3.x (3.6 is the latest at the time of writing).
Linux | Mac | Windows | |
---|---|---|---|
64-bit | 64-bit (bash installer) | 64-bit (bash installer) | 64-bit (exe installer) |
32-bit | 32-bit (bash installer) | 32-bit (exe installer) |
Install miniconda on your machine. Detailed instructions:
- Linux: http://conda.pydata.org/docs/install/quick.html#linux-miniconda-install
- Mac: http://conda.pydata.org/docs/install/quick.html#os-x-miniconda-install
- Windows: http://conda.pydata.org/docs/install/quick.html#windows-miniconda-install
Setup the bdap
environment.
git clone https://github.com/soumendra/learn-spark-python.git
cd learn-spark-python
You should already have a
GitHub
account, and should have installedgit
in your system to be able to clone the repo in the last instruction. If not, you can download the repository here, extract the folder,cd
into it, and then continue.
If you are on Windows, rename meta_windows_patch.yml
to meta.yml
.
Create bdap
. Running this command will create a new conda
environment that is provisioned with all libraries listed above.
conda env create -f bdap.yml
Verify that the bdap
environment was created in your environments:
conda info --envs
Cleanup downloaded libraries (remove tarballs, zip files, etc):
conda clean -tp
To uninstall the environment:
conda env remove -n bdap
Now that you have created an environment, in order to use it, you will need to activate the environment. This must be done each time you begin a new working session, i.e., open a new terminal window.
Activate the bdap
environment:
$ source activate bdap
Depending on shell either:
$ source activate bdap
or
$ activate bdap
That's it. Now all of the bdap
libraries are available to you. You can start a Jupyter Notebook with:
jupyter notebook
To exit the environment when you have completed your work session, simply close the terminal window.
- Get started with IPython (ignore the sections on installation)
- You should understand the following concepts:
- Managing Cells
- Cell Types
- Moving Cells around
- Executing and Creating New Cells
- Handling Notebooks
- Creating new notebooks (with different kernels)
- Exporting notebooks to various formats
- IPython Kernels
- Keyboard Shortcuts
- Managing Cells
IPython Homepage: https://ipython.org/
- What is markdown?
- GitHub Flavored Markdown (used by IPython)
Here are some examples of markdown in action (edit this cell to see the raw text)
- Styling Text
- italicised
- bold
- italicised and bold
- styles mixed together
- Unordered lists
- Nested Unordered Lists
- Ordered Lists
- Nested Lists
- More Nesting * More nested lists
- Writing Code Blocks
x = 0
x = 2 + 2
hist(rnorm(1000))
Check if Java
sdk is installed in your system. If not, follow the following instructions (for Ubuntu only, mac folks can do brew install java
and windows folks can just download and execute the binary)
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y software-properties-common curl
sudo apt-get install default-jre default-jdk
- Please note that Java was installed in the last section, and don't actually need to go through this section to install Java.
- This is for adventurous students who wish to install the jdk/jre from Oracle (if you don't know what I am talking about, you don't need to do it.
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
If the Oracle version is the only one you have installed, then you have only one Java framework installed i your system. Otherwise, if you have multiple Java frameworks installed in your system, you can choose one of them with the following:
sudo update-alternatives --config java
sudo update-alternatives --config javac
And here is how you can set up your JAVA_HOME
- Note in the install path from (the command above) sudo update-alternatives --config java
- Add JAVA_HOME="YOUR_PATH" to /etc/environment source /etc/environment echo $JAVA_HOME
source activate bdap
After executing that in the terminal, follow the instructions in these slides.
- Signup for an account in https://github.com/ (remember your github username and password)
- Setup GitHub (setup your local git installation for GitHub)
git config --global user.name "YourName"
git config --global user.email "YourGitHubEmailId"
- Learn the basics of git
- GitHub Workflow using web-ui
- (Optional) Forking Repos and GitHub Fork Workflow
- (Optional) Set up SSH authentication