Spark Demo Course

Description

This README contains instuction on how to install PySpark on your local Windows machine and start on it jobs.

Pre-requirements

Python

You need have Python on your local machine. If you don't have it, go to https://www.python.org/downloads/windows/ and install latest version. After the installation is complete, close the Command Prompt if it was already open, reopen it and check if you can successfully run

python --version

command. After this install jupyter notebook, pandas, numpy, pyspark and findspark(you can easily find guides in the Web) and all other staff you need.

Msvcr100.dll

If you have Visual Studio c++ skip this step. If you don't have follow this guide https://www.computer-setup.ru/msvcr100-dll-chto-eto-za-oshibka-kak-ispravit?ysclid=l84uyxs1qq310355183 to install Msvcr100.dll.

Kaggle

You also need to register on kaggle(for example using your google account) to download datasets.

Microsoft SQL Server

You alse need to have microsoft sql server(you can find how download it in internet).

Installation guide

If you want to have set up same as me, you can download Spark_Demo_Course from this repository and go directly to step "Last step". For other people you need to go throw all installation guide.

Java

First of all you need to install Java JDK 7<=version<=11. I really recommended to install Java JDK version 8 to not have problems with versions. For the moment you can't download Java JDK from Oracle archive from Belarus. Try to use VPN or download it from external sources.

Spark

Go to the page https://spark.apache.org/downloads.html. Select the latest stable release of Spark. Choose a package type: select a version that is pre-built for the latest version of Hadoop such as Pre-built for Hadoop 3.3. If you want to have same versions as me, choose Spark 3.1.3 and Hadoop 2.7. After you choose package-type, you will see under it Download Spark, click on link, you will be redirecting to the next page where you need to click on the link like below.

Hadoop

Go to the page https://hadoop.apache.org/release/2.7.0.html. Select version of Hadoop that you choose while installing Spark(if you remember, you choose pre-built version of Hadoop). Choose download like below.

Winutils.exe

Windows users need also install winutils.exe to work with Spark. Go to this page https://github.com/steveloughran/winutils. You can find winutils.exe in hadoop-'version'\bin. If you don't find your version here, find it in another sources.

Unpacking

Now you need to unpack Hadoop, Java and Spark to make just folder. After you unpack packages you need to put winutils.exe into the hadoop-2.7.0\bin and spark-3.1.3-bin-hadoop2.7\bin.

Last step

This is the last step. Use win+r and write sysdm.cpl, click ok. Click on Additionally, Environment Variables. You need to create 3 variables in system variables(second window):

JAVA_HOME=...\Java\jdk-8;
HADOOP_HOME=...\hadoop-2.7.0;
SPARK_HOME=...\spark-3.1.3-bin-hadoop2.7.

Then in the system variables you need to add 3 paths into path(second window):

%HADOOP_HOME%\bin;
%SPARK_HOME%\bin;
%JAVA_HOME%\bin.

Environment ready

Now you can open new command prompt and type there

pyspark

You need to see something like below.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1_PySpark_Basics		1_PySpark_Basics
2_Type_of_Files		2_Type_of_Files
3_Spark_Basics		3_Spark_Basics
4_Spark_Join_and_Shuffle		4_Spark_Join_and_Shuffle
5_Data_Skew		5_Data_Skew
6_Memory_Management		6_Memory_Management
7_Different_Underhood_Stuff		7_Different_Underhood_Stuff
8_Best_Practises, Configs and Other_Stuff		8_Best_Practises, Configs and Other_Stuff
Delta_Lake		Delta_Lake
Questions_for_interview		Questions_for_interview
Spark_Demo_Course		Spark_Demo_Course
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Demo Course

Description

Pre-requirements

Python

Msvcr100.dll

Kaggle

Microsoft SQL Server

Installation guide

Java

Spark

Hadoop

Winutils.exe

Unpacking

Last step

Environment ready

About

Releases

Packages

Contributors 2

Languages

andrewD46/spark_demo_course

Folders and files

Latest commit

History

Repository files navigation

Spark Demo Course

Description

Pre-requirements

Python

Msvcr100.dll

Kaggle

Microsoft SQL Server

Installation guide

Java

Spark

Hadoop

Winutils.exe

Unpacking

Last step

Environment ready

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages