How to Parallelize and Distribute Collection in PySpark

How to Parallelize and Distribute Collection in PySpark

What is PySpark?

PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. It is a popular open source framework that ensures data processing with lightning speed and supports various languages like Scala, Python, Java, and R. Using PySpark, you can work with RDDs in Python programming language also.

Create SparkContext

class pyspark.SparkContext(master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None, profiler_cls=<class 'pyspark.profiler.BasicProfiler'>):

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None): The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

Import Modules

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

Create entry points to spark

In [2]:
# create entry points to spark
try:
    #stop sparkcontext if running
    sc.stop()
except:
    pass
finally:
    #create object of SparkContext
    sc = SparkContext()
    spark = SparkSession(sparkContext=sc)

Creates an RDD with a list of Integers.

RDD(Resilient Distributed Datasets): – These are basically dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

parallelize(c, numSlices=None): Distribute a local Python collection to form an RDD.

collect(): Function is used to retrieve all the elements of the dataset

In [3]:
x = [1, 2, 3, 4]

#create an RDD
#parallelize() is a function in SparkContext and is used to create an RDD from a list collection.
rdd = sc.parallelize(x)

display(
    rdd,
    rdd.collect()
)
ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:262
[1, 2, 3, 4]
In [4]:
print(type(rdd))
<class 'pyspark.rdd.RDD'>

Parallelize another example

In [5]:
import numpy as np
In [6]:
rdd1 = sc.parallelize(np.arange(0, 30, 2))

display(
    rdd1.collect()
)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

Distributes a RDD

glom(): Return an RDD created by coalescing all elements within each partition into a list.

In [7]:
#create an RDD and 5 is number of partition 
#np.arange(0, 30, 2) - give the list from 0 to 30 with step 2. 
rdd2 = sc.parallelize(np.arange(0, 30, 2), 5)

display(
    rdd2.collect(),
    rdd2.glom().collect(),
)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
[[0, 2, 4], [6, 8, 10], [12, 14, 16], [18, 20, 22], [24, 26, 28]]

We can see five partitons of all elements.

Another RDD partitions

In [10]:
#create an RDD and partition is 2
rdd3 = sc.parallelize(np.arange(0, 30, 5), 2)
rdd3.glom().collect()
Out[10]:
[[0, 5, 10], [15, 20, 25]]
In [11]:
rdd4 = sc.parallelize(np.arange(0, 30), 2)
rdd4.glom().collect()
Out[11]:
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]]
In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing