Role of StringIndexer and Pipelines in PySpark ML Feature

Role of StringIndexer and Pipelines in PySpark ML Feature

What is PySpark ML?

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

What is StringIndexer?

class pyspark.ml.feature.StringIndexer(inputCol=None, outputCol=None, inputCols=None, outputCols=None, handleInvalid='error', stringOrderType='frequencyDesc') - StringIndexer encodes a string column of labels to a column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels).

By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is 'frequencyDesc'. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically

four ordering options are supported:

1. "frequencyDesc": descending order by label frequency (most frequent label assigned 0)
2. "frequencyAsc": ascending order by label frequency (least frequent label assigned 0)
3. "alphabetDesc": descending alphabetical order
4. "alphabetAsc": ascending alphabetical order (default = "frequencyDesc")

Let us see example

Create SparkSession

In [1]:
#import SparkSession
from pyspark.sql import SparkSession

SparkSession is an entry point to Spark to work with RDD, DataFrame, and Dataset. To create SparkSession in Python, we need to use the builder() method and calling getOrCreate() method.

If SparkSession already exists it returns otherwise create a new SparkSession.

In [2]:
spark = SparkSession.builder.appName('xvspark').getOrCreate()

Create dataframe by declaring the schema

In [3]:
from pyspark.sql.types import *

StructType class to define the structure of the DataFrame.

In [4]:
#create the structure of schema
schema = StructType().add("id","integer").add("name","string").add("qualification","string").add("age", "integer").add("gender", "string")
In [5]:
#create data
data = [
    (1,'John',"B.A.", 20, "Male"),
    (2,'Martha',"B.Com.", 20, "Female"),
    (3,'Mona',"B.Com.", 21, "Female"),
    (4,'Harish',"B.Sc.", 22, "Male"),
    (5,'Jonny',"B.A.", 22, "Male"),
    (6,'Maria',"B.A.", 23, "Female"),
    (7,'Monalisa',"B.A.", 21, "Female")
]
In [6]:
#create dataframe
df = spark.createDataFrame(data, schema=schema)
In [7]:
#columns of dataframe
df.columns
Out[7]:
['id', 'name', 'qualification', 'age', 'gender']
In [8]:
df.show()
+---+--------+-------------+---+------+
| id|    name|qualification|age|gender|
+---+--------+-------------+---+------+
|  1|    John|         B.A.| 20|  Male|
|  2|  Martha|       B.Com.| 20|Female|
|  3|    Mona|       B.Com.| 21|Female|
|  4|  Harish|        B.Sc.| 22|  Male|
|  5|   Jonny|         B.A.| 22|  Male|
|  6|   Maria|         B.A.| 23|Female|
|  7|Monalisa|         B.A.| 21|Female|
+---+--------+-------------+---+------+

Apply StringIndexer to a string column

In [9]:
#import required libraries
from pyspark.ml.feature import StringIndexer

qualification is a string column with three different labels. Applying StringIndexer with qualification as the input column and qualificationIndex as the output column.

Apply StringIndexer to qualification column

In [10]:
qualification_indexer = StringIndexer(inputCol="qualification", outputCol="qualificationIndex")

#Fits a model to the input dataset with optional parameters.
df1 = qualification_indexer.fit(df).transform(df)
df1.show()
+---+--------+-------------+---+------+------------------+
| id|    name|qualification|age|gender|qualificationIndex|
+---+--------+-------------+---+------+------------------+
|  1|    John|         B.A.| 20|  Male|               0.0|
|  2|  Martha|       B.Com.| 20|Female|               1.0|
|  3|    Mona|       B.Com.| 21|Female|               1.0|
|  4|  Harish|        B.Sc.| 22|  Male|               2.0|
|  5|   Jonny|         B.A.| 22|  Male|               0.0|
|  6|   Maria|         B.A.| 23|Female|               0.0|
|  7|Monalisa|         B.A.| 21|Female|               0.0|
+---+--------+-------------+---+------+------------------+

"B.A." gets index 0 because it is the most frequent, then "B.Com" gets index 1 and "B.Sc." gets index 2.

Apply StringIndexer to gender column

In [11]:
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

#Fits a model to the input dataset with optional parameters.
df2 = gender_indexer.fit(df).transform(df)
df2.show()
+---+--------+-------------+---+------+-----------+
| id|    name|qualification|age|gender|genderIndex|
+---+--------+-------------+---+------+-----------+
|  1|    John|         B.A.| 20|  Male|        1.0|
|  2|  Martha|       B.Com.| 20|Female|        0.0|
|  3|    Mona|       B.Com.| 21|Female|        0.0|
|  4|  Harish|        B.Sc.| 22|  Male|        1.0|
|  5|   Jonny|         B.A.| 22|  Male|        1.0|
|  6|   Maria|         B.A.| 23|Female|        0.0|
|  7|Monalisa|         B.A.| 21|Female|        0.0|
+---+--------+-------------+---+------+-----------+

Pipeline

What is ML Pipelines and How it works?

In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

Split each document’s text into words.
Convert each document’s words into a numerical feature vector.
Learn a prediction model using the feature vectors and labels.

MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.

In [12]:
#import module
from pyspark.ml import Pipeline

Reload Data

In [13]:
schema = StructType().add("id","integer").add("name","string").add("qualification","string").add("age", "integer").add("gender", "string")
df = spark.createDataFrame(data, schema=schema)

Create Pipeline

In [14]:
qualification_indexer = StringIndexer(inputCol="qualification", outputCol="qualificationIndex")
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

pipeline = Pipeline(stages=[qualification_indexer, gender_indexer])
In [15]:
model = pipeline.fit(df).transform(df)
model.show()
+---+--------+-------------+---+------+------------------+-----------+
| id|    name|qualification|age|gender|qualificationIndex|genderIndex|
+---+--------+-------------+---+------+------------------+-----------+
|  1|    John|         B.A.| 20|  Male|               0.0|        1.0|
|  2|  Martha|       B.Com.| 20|Female|               1.0|        0.0|
|  3|    Mona|       B.Com.| 21|Female|               1.0|        0.0|
|  4|  Harish|        B.Sc.| 22|  Male|               2.0|        1.0|
|  5|   Jonny|         B.A.| 22|  Male|               0.0|        1.0|
|  6|   Maria|         B.A.| 23|Female|               0.0|        0.0|
|  7|Monalisa|         B.A.| 21|Female|               0.0|        0.0|
+---+--------+-------------+---+------+------------------+-----------+

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing