Friday, June 19, 2020

PySpark setup in IntelliJ idea

Introduction


This article demonstrates how to setup PySpark in Intellij Idea.

Download and Setup Intellij IDEA for PySpark


Step 1: Download Intellij Idea

You can download and install Intellij Idea from below location 
URL: https://www.jetbrains.com/idea/download/#section=windows


Step2: Prerequisite for PySpark

Makesure below softwares are installed and configured properly 
1. Java 2. Python 3. Spark 4. Hadoop Home for winutils.exe



Step3: Install Python plugin in Intellij Idea




Step4: Create new Python project in Intellij Idea






Step5: Setup PySpark required files in Project Structure




Step6: Create sample PySpark program

from pyspark.sql import SparkSession

appName = "AddColumnUsingUDF"
#Spark Session
spark = SparkSession.Builder().appName(appName).getOrCreate()

filePath = 'C:/tools/data/sampleEmpSalaryData.csv'
#Read csv file
sampleSalaryDF = spark.read.format('csv').options(header='true').load(filePath)

#DF has many columns, so restricting to limited columns for better display
#Here I am using Python list to pass column names, you may pass column names as directly in select method.
colList = ['Emp ID', 'First Name', 'Last Name', 'Date of Birth', 'Salary', 'Last % Hike']
sampleSalaryDF = sampleSalaryDF.select(*colList)
sampleSalaryDF.show(5)



 

Step7: run PySpark program

This program will be executed without any issues if you configured Env property and PySpark files as specified above




Step8: Results Window

If program get executed without any issues, you can find results as like below




Simple Program created to read csv file, and tested its working with Intellij idea.

Sample Data: courtesy to eforexcel.com


Copyright - There is no copyright on the code. You can copy, change and distribute it freely. Just mentioning this site should be fair
(C) November 2020, manivelcode