Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2

Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2

Part 1 - What is StringIndexer?

We have already discussed regarding StringIndexer (link)

What is OneHotEncoder?

class pyspark.ml.feature.OneHotEncoder(inputCols=None, outputCols=None, handleInvalid='error', dropLast=True, inputCol=None, outputCol=None) - One Hot Encoding is a technique for converting categorical attributes into a binary vector.

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index.

For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0].

The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Let us see an example

Create SparkSession

In [2]:
#import SparkSession
from pyspark.sql import SparkSession

SparkSession is an entry point to Spark to work with RDD, DataFrame, and Dataset. To create SparkSession in Python, we need to use the builder() method and calling getOrCreate() method.

If SparkSession already exists it returns otherwise create a new SparkSession.

In [3]:
spark = SparkSession.builder.appName('xvspark').getOrCreate()

Create dataframe by declaring the schema

In [4]:
from pyspark.sql.types import *

StructType class to define the structure of the DataFrame.

In [5]:
#create the structure of schema
schema = StructType().add("id","integer").add("name","string").add("qualification","string").add("age", "integer").add("gender", "string")
In [6]:
#create data
data = [
    (1,'John',"B.A.", 20, "Male"),
    (2,'Martha',"B.Com.", 20, "Female"),
    (3,'Mona',"B.Com.", 21, "Female"),
    (4,'Harish',"B.Sc.", 22, "Male"),
    (5,'Jonny',"B.A.", 22, "Male"),
    (6,'Maria',"B.A.", 23, "Female"),
    (7,'Monalisa',"B.A.", 21, "Female")
]
In [7]:
#create dataframe
df = spark.createDataFrame(data, schema=schema)
In [8]:
#columns of dataframe
df.columns
Out[8]:
['id', 'name', 'qualification', 'age', 'gender']
In [9]:
df.show()
+---+--------+-------------+---+------+
| id|    name|qualification|age|gender|
+---+--------+-------------+---+------+
|  1|    John|         B.A.| 20|  Male|
|  2|  Martha|       B.Com.| 20|Female|
|  3|    Mona|       B.Com.| 21|Female|
|  4|  Harish|        B.Sc.| 22|  Male|
|  5|   Jonny|         B.A.| 22|  Male|
|  6|   Maria|         B.A.| 23|Female|
|  7|Monalisa|         B.A.| 21|Female|
+---+--------+-------------+---+------+

Apply OneHotEncoder to qualification and gender column

We can not apply OneHotEncoder to string columns directly. We need to first convert string columns to numeric value. For that we will use StringIndexer. After that we can apply OneHotEncoder.

Apply StringIndexer to qualification column

In [10]:
#import required libraries
from pyspark.ml.feature import StringIndexer
In [11]:
qualification_indexer = StringIndexer(inputCol="qualification", outputCol="qualificationIndex")

#Fits a model to the input dataset with optional parameters.
df1 = qualification_indexer.fit(df).transform(df)
df1.show()
+---+--------+-------------+---+------+------------------+
| id|    name|qualification|age|gender|qualificationIndex|
+---+--------+-------------+---+------+------------------+
|  1|    John|         B.A.| 20|  Male|               0.0|
|  2|  Martha|       B.Com.| 20|Female|               1.0|
|  3|    Mona|       B.Com.| 21|Female|               1.0|
|  4|  Harish|        B.Sc.| 22|  Male|               2.0|
|  5|   Jonny|         B.A.| 22|  Male|               0.0|
|  6|   Maria|         B.A.| 23|Female|               0.0|
|  7|Monalisa|         B.A.| 21|Female|               0.0|
+---+--------+-------------+---+------+------------------+

"B.A." gets index 0 because it is the most frequent, then "B.Com" gets index 1 and "B.Sc." gets index 2.

Apply StringIndexer to gender column

In [12]:
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

#Fits a model to the input dataset with optional parameters.
df2 = gender_indexer.fit(df).transform(df)
df2.show()
+---+--------+-------------+---+------+-----------+
| id|    name|qualification|age|gender|genderIndex|
+---+--------+-------------+---+------+-----------+
|  1|    John|         B.A.| 20|  Male|        1.0|
|  2|  Martha|       B.Com.| 20|Female|        0.0|
|  3|    Mona|       B.Com.| 21|Female|        0.0|
|  4|  Harish|        B.Sc.| 22|  Male|        1.0|
|  5|   Jonny|         B.A.| 22|  Male|        1.0|
|  6|   Maria|         B.A.| 23|Female|        0.0|
|  7|Monalisa|         B.A.| 21|Female|        0.0|
+---+--------+-------------+---+------+-----------+

Apply OneHotEncoder to qualificationIndex column

In [13]:
from pyspark.ml.feature import OneHotEncoder
In [14]:
#onehotencoder to qualificationIndex
onehotencoder_qualification_vector = OneHotEncoder(inputCol="qualificationIndex", outputCol="qualification_vec")
df11 = onehotencoder_qualification_vector.fit(df1).transform(df1)
In [15]:
df11.show()
+---+--------+-------------+---+------+------------------+-----------------+
| id|    name|qualification|age|gender|qualificationIndex|qualification_vec|
+---+--------+-------------+---+------+------------------+-----------------+
|  1|    John|         B.A.| 20|  Male|               0.0|    (2,[0],[1.0])|
|  2|  Martha|       B.Com.| 20|Female|               1.0|    (2,[1],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|               1.0|    (2,[1],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|               2.0|        (2,[],[])|
|  5|   Jonny|         B.A.| 22|  Male|               0.0|    (2,[0],[1.0])|
|  6|   Maria|         B.A.| 23|Female|               0.0|    (2,[0],[1.0])|
|  7|Monalisa|         B.A.| 21|Female|               0.0|    (2,[0],[1.0])|
+---+--------+-------------+---+------+------------------+-----------------+

In [16]:
#onehotencoder to genderIndex
onehotencoder_gender_vector = OneHotEncoder(inputCol="genderIndex", outputCol="gender_vec")
df12 = onehotencoder_gender_vector.fit(df2).transform(df2)
In [17]:
df12.show()
+---+--------+-------------+---+------+-----------+-------------+
| id|    name|qualification|age|gender|genderIndex|   gender_vec|
+---+--------+-------------+---+------+-----------+-------------+
|  1|    John|         B.A.| 20|  Male|        1.0|    (1,[],[])|
|  2|  Martha|       B.Com.| 20|Female|        0.0|(1,[0],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|        0.0|(1,[0],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|        1.0|    (1,[],[])|
|  5|   Jonny|         B.A.| 22|  Male|        1.0|    (1,[],[])|
|  6|   Maria|         B.A.| 23|Female|        0.0|(1,[0],[1.0])|
|  7|Monalisa|         B.A.| 21|Female|        0.0|(1,[0],[1.0])|
+---+--------+-------------+---+------+-----------+-------------+

One hot encoding of a numeric column

Direct one-hot-encoding without indexing

In [18]:
onehotencoder_age_vector = OneHotEncoder(inputCol="age", outputCol="age_vec")
df13 = onehotencoder_age_vector.fit(df2).transform(df2)
In [19]:
df13.show()
+---+--------+-------------+---+------+-----------+---------------+
| id|    name|qualification|age|gender|genderIndex|        age_vec|
+---+--------+-------------+---+------+-----------+---------------+
|  1|    John|         B.A.| 20|  Male|        1.0|(23,[20],[1.0])|
|  2|  Martha|       B.Com.| 20|Female|        0.0|(23,[20],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|        0.0|(23,[21],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|        1.0|(23,[22],[1.0])|
|  5|   Jonny|         B.A.| 22|  Male|        1.0|(23,[22],[1.0])|
|  6|   Maria|         B.A.| 23|Female|        0.0|     (23,[],[])|
|  7|Monalisa|         B.A.| 21|Female|        0.0|(23,[21],[1.0])|
+---+--------+-------------+---+------+-----------+---------------+

Direct one-hot-encoding after indexing

In [22]:
age_indexer = StringIndexer(inputCol="age", outputCol="ageIndex")
df13 = age_indexer.fit(df).transform(df)


onehotencoder_age_vector = OneHotEncoder(inputCol="ageIndex", outputCol="age_vec")
df14 = onehotencoder_age_vector.fit(df13).transform(df13)

df14.show()
+---+--------+-------------+---+------+--------+-------------+
| id|    name|qualification|age|gender|ageIndex|      age_vec|
+---+--------+-------------+---+------+--------+-------------+
|  1|    John|         B.A.| 20|  Male|     0.0|(3,[0],[1.0])|
|  2|  Martha|       B.Com.| 20|Female|     0.0|(3,[0],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|     1.0|(3,[1],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|     2.0|(3,[2],[1.0])|
|  5|   Jonny|         B.A.| 22|  Male|     2.0|(3,[2],[1.0])|
|  6|   Maria|         B.A.| 23|Female|     3.0|    (3,[],[])|
|  7|Monalisa|         B.A.| 21|Female|     1.0|(3,[1],[1.0])|
+---+--------+-------------+---+------+--------+-------------+

Using Pipeline

In [23]:
#import module
from pyspark.ml import Pipeline

Reload Data

In [24]:
schema = StructType().add("id","integer").add("name","string").add("qualification","string").add("age", "integer").add("gender", "string")
df = spark.createDataFrame(data, schema=schema)

Create Pipeline and pass all stages

In [25]:
qualification_indexer = StringIndexer(inputCol="qualification", outputCol="qualificationIndex")
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")
age_indexer = StringIndexer(inputCol="age", outputCol="ageIndex")

onehotencoder_qualification_vector = OneHotEncoder(inputCol="qualificationIndex", outputCol="qualification_vec")
onehotencoder_gender_vector = OneHotEncoder(inputCol="genderIndex", outputCol="gender_vec")
onehotencoder_age_vector = OneHotEncoder(inputCol="ageIndex", outputCol="age_vec")


pipeline = Pipeline(stages=[qualification_indexer, 
                            gender_indexer, 
                            age_indexer,
                            onehotencoder_qualification_vector, 
                            onehotencoder_gender_vector,
                            onehotencoder_age_vector
                           ])
In [27]:
df_transformed = pipeline.fit(df).transform(df)
df_transformed.show()
+---+--------+-------------+---+------+------------------+-----------+--------+-----------------+-------------+-------------+
| id|    name|qualification|age|gender|qualificationIndex|genderIndex|ageIndex|qualification_vec|   gender_vec|      age_vec|
+---+--------+-------------+---+------+------------------+-----------+--------+-----------------+-------------+-------------+
|  1|    John|         B.A.| 20|  Male|               0.0|        1.0|     0.0|    (2,[0],[1.0])|    (1,[],[])|(3,[0],[1.0])|
|  2|  Martha|       B.Com.| 20|Female|               1.0|        0.0|     0.0|    (2,[1],[1.0])|(1,[0],[1.0])|(3,[0],[1.0])|
|  3|    Mona|       B.Com.| 21|Female|               1.0|        0.0|     1.0|    (2,[1],[1.0])|(1,[0],[1.0])|(3,[1],[1.0])|
|  4|  Harish|        B.Sc.| 22|  Male|               2.0|        1.0|     2.0|        (2,[],[])|    (1,[],[])|(3,[2],[1.0])|
|  5|   Jonny|         B.A.| 22|  Male|               0.0|        1.0|     2.0|    (2,[0],[1.0])|    (1,[],[])|(3,[2],[1.0])|
|  6|   Maria|         B.A.| 23|Female|               0.0|        0.0|     3.0|    (2,[0],[1.0])|(1,[0],[1.0])|    (3,[],[])|
|  7|Monalisa|         B.A.| 21|Female|               0.0|        0.0|     1.0|    (2,[0],[1.0])|(1,[0],[1.0])|(3,[1],[1.0])|
+---+--------+-------------+---+------+------------------+-----------+--------+-----------------+-------------+-------------+

You can convert it to Pandas DataFrame

In [29]:
df_transformed.toPandas()
Out[29]:
id name qualification age gender qualificationIndex genderIndex ageIndex qualification_vec gender_vec age_vec
0 1 John B.A. 20 Male 0.0 1.0 0.0 (1.0, 0.0) (0.0) (1.0, 0.0, 0.0)
1 2 Martha B.Com. 20 Female 1.0 0.0 0.0 (0.0, 1.0) (1.0) (1.0, 0.0, 0.0)
2 3 Mona B.Com. 21 Female 1.0 0.0 1.0 (0.0, 1.0) (1.0) (0.0, 1.0, 0.0)
3 4 Harish B.Sc. 22 Male 2.0 1.0 2.0 (0.0, 0.0) (0.0) (0.0, 0.0, 1.0)
4 5 Jonny B.A. 22 Male 0.0 1.0 2.0 (1.0, 0.0) (0.0) (0.0, 0.0, 1.0)
5 6 Maria B.A. 23 Female 0.0 0.0 3.0 (1.0, 0.0) (1.0) (0.0, 0.0, 0.0)
6 7 Monalisa B.A. 21 Female 0.0 0.0 1.0 (1.0, 0.0) (1.0) (0.0, 1.0, 0.0)
In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing