Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

By Nutan: Date Nov 19, 2020

What is PySpark ML?

DataFrame-based machine learning APIs to let users quickly assemble and configure practical machine learning pipelines.

Logistic Regression

Logistic regression is a popular Machine Learning classification algorithm to predict a categorical response.

In statistics, the logistic model is used to model the probability of a certain class or event existing such as yes/no, pass/fail, win/lose, alive/dead or healthy/sick.

That binary variable contains data coded as 1 (yes, pass, win, alive, etc.) or 0 (no, fail, lose, dead, etc.).

Data Set Information:

This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.

If you want to download data, you can download here.

This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.

Attribute Information:

  1. Class: no-recurrence-events, recurrence-events
  2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
  3. menopause: lt40, ge40, premeno.
  4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
  5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
  6. node-caps: yes, no.
  7. deg-malig: 1, 2, 3.
  8. breast: left, right.
  9. breast-quad: left-up, left-low, right-up, right-low, central.
  10. irradiat: yes, no.

Create SparkSession

In [1]:
#import SparkSession
from pyspark.sql import SparkSession

SparkSession is an entry point to Spark to work with RDD, DataFrame, and Dataset. To create SparkSession in Python, we need to use the builder() method and calling getOrCreate() method.

If SparkSession already exists it returns otherwise create a new SparkSession.

In [2]:
spark = SparkSession.builder.appName('regression').getOrCreate()

Load data

In [3]:
#read the dataset
df = spark.read.csv('input/breast-cancer.csv', inferSchema=True, header=True)
In [4]:
#view five records
df.show(5)
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
only showing top 5 rows

In [5]:
#print dataframe columns and count
print(df.columns)
print(df.count())
['class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
286
In [6]:
df.printSchema()
root
 |-- class: string (nullable = true)
 |-- age: string (nullable = true)
 |-- menopause: string (nullable = true)
 |-- tumor-size: string (nullable = true)
 |-- inv-nodes: string (nullable = true)
 |-- node-caps: string (nullable = true)
 |-- deg-malig: integer (nullable = true)
 |-- breast: string (nullable = true)
 |-- breast-quad: string (nullable = true)
 |-- irradiat: string (nullable = true)

Missing records

In [7]:
from pyspark.sql.functions import isnan, when, count, col

Check missing value for single column

In [8]:
df.filter(df['age'].isNull()).show()
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
|class|age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+

We can see there is no null value for age column.

Check missing value for all columns

In [9]:
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
|class|age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+
|    0|  0|        0|         0|        0|        0|        0|     0|          0|       0|
+-----+---+---------+----------+---------+---------+---------+------+-----------+--------+

We can see, there is no null values.

Drop null records

In this dataset there is no null values. In case you have, then you can use df.na.drop().

It returns the dataset after dropping missing values. So, you can use show after calling drop().

In [10]:
df.na.drop().show(5)
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+
only showing top 5 rows

As there were no missing values, the number of records remains the same.

In [11]:
print(df.count())
286

Convert categorical data to numerical

We are going to use StringIndexer and OneHotEncoder of PySpark ML feature to convert string columns to numeric.

If you want to know more about them, you can go through my previous articles:

Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1

Role of OneHotEncoder and Pipelines in PySpark ML Feature — Part 2

In [12]:
#import libraries
from pyspark.ml.feature import StringIndexer, OneHotEncoder

Change class column into numeric value

Let us check 'class' column:

In [13]:
df.groupBy('class').count().show()
+--------------------+-----+
|               class|count|
+--------------------+-----+
|no-recurrence-events|  201|
|   recurrence-events|   85|
+--------------------+-----+

class is the target column, that has two distinct values, 'no-recurrence-events' and 'recurrence-events'. We are going to change them into numeric.

As there will be only two values 0 and 1 after converting to numeric, we will not use one-hot encoding.

In [14]:
class_indexer = StringIndexer(inputCol="class", outputCol="label")

#Fit and transform the dataframe
df = class_indexer.fit(df).transform(df)
In [15]:
df.show(5)
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|label|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|  0.0|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|  0.0|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|  0.0|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|  0.0|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|  0.0|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+
only showing top 5 rows

Now, you can see a new column named 'label'. You can print just 'class' and 'label' columsn to see the transformation.

In [16]:
df.select(['class', 'label']).show(5)
+--------------------+-----+
|               class|label|
+--------------------+-----+
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
|no-recurrence-events|  0.0|
+--------------------+-----+
only showing top 5 rows

Create function to transform string column to numeric

We have lot of string columns, so we will create a function to convert string columns to numeric.

We will use a pattern for StringIndexer output columns: input column name + '-index'. For example:

'age' -> 'age-index'.

The output column shall be passed to OneHotEncoder. Similar to the above, we will use a pattern for OneHotEncoder output columns: input column name + '-vector'. Thus we will finally get, for example,

'age' -> 'age-index' -> 'age-vector'.
In [17]:
def transformColumnsToNumeric(df, inputCol):
    
    #apply StringIndexer to inputCol
    inputCol_indexer = StringIndexer(inputCol = inputCol, outputCol = inputCol + "-index").fit(df)
    df = inputCol_indexer.transform(df)
    
    onehotencoder_vector = OneHotEncoder(inputCol = inputCol + "-index", outputCol = inputCol + "-vector")
    df = onehotencoder_vector.fit(df).transform(df)
    
    return df
    
    pass
In [18]:
df = transformColumnsToNumeric(df, "age")
df = transformColumnsToNumeric(df, "menopause")
df = transformColumnsToNumeric(df, "tumor-size")
df = transformColumnsToNumeric(df, "inv-nodes")
df = transformColumnsToNumeric(df, "node-caps")
df = transformColumnsToNumeric(df, "breast")
df = transformColumnsToNumeric(df, "breast-quad")
df = transformColumnsToNumeric(df, "irradiat")
df.show(5)
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+---------+-------------+---------------+----------------+----------------+-----------------+---------------+----------------+---------------+----------------+------------+-------------+-----------------+------------------+--------------+---------------+
|               class|  age|menopause|tumor-size|inv-nodes|node-caps|deg-malig|breast|breast-quad|irradiat|label|age-index|   age-vector|menopause-index|menopause-vector|tumor-size-index|tumor-size-vector|inv-nodes-index|inv-nodes-vector|node-caps-index|node-caps-vector|breast-index|breast-vector|breast-quad-index|breast-quad-vector|irradiat-index|irradiat-vector|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+---------+-------------+---------------+----------------+----------------+-----------------+---------------+----------------+---------------+----------------+------------+-------------+-----------------+------------------+--------------+---------------+
|no-recurrence-events|30-39|  premeno|     30-34|      0-2|       no|        3|  left|   left_low|      no|  0.0|      3.0|(5,[3],[1.0])|            0.0|   (2,[0],[1.0])|             0.0|   (10,[0],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         0.0|(1,[0],[1.0])|              0.0|     (5,[0],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2| right|   right_up|      no|  0.0|      1.0|(5,[1],[1.0])|            0.0|   (2,[0],[1.0])|             2.0|   (10,[2],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         1.0|    (1,[],[])|              2.0|     (5,[2],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|40-49|  premeno|     20-24|      0-2|       no|        2|  left|   left_low|      no|  0.0|      1.0|(5,[1],[1.0])|            0.0|   (2,[0],[1.0])|             2.0|   (10,[2],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         0.0|(1,[0],[1.0])|              0.0|     (5,[0],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|60-69|     ge40|     15-19|      0-2|       no|        2| right|    left_up|      no|  0.0|      2.0|(5,[2],[1.0])|            1.0|   (2,[1],[1.0])|             3.0|   (10,[3],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         1.0|    (1,[],[])|              1.0|     (5,[1],[1.0])|           0.0|  (1,[0],[1.0])|
|no-recurrence-events|40-49|  premeno|       0-4|      0-2|       no|        2| right|  right_low|      no|  0.0|      1.0|(5,[1],[1.0])|            0.0|   (2,[0],[1.0])|             7.0|   (10,[7],[1.0])|            0.0|   (6,[0],[1.0])|            0.0|   (2,[0],[1.0])|         1.0|    (1,[],[])|              3.0|     (5,[3],[1.0])|           0.0|  (1,[0],[1.0])|
+--------------------+-----+---------+----------+---------+---------+---------+------+-----------+--------+-----+---------+-------------+---------------+----------------+----------------+-----------------+---------------+----------------+---------------+----------------+------------+-------------+-----------------+------------------+--------------+---------------+
only showing top 5 rows

You can put all these in a Pipeline, but I wanted to keep it as simple as possible.

Feature transformer - VectorAssembler

If you are new to VectorAssembler, you may read my article

Feature Transformer VectorAssembler in PySpark ML Feature — Part 3

In [20]:
from pyspark.ml.feature import VectorAssembler

Let us list the columns

In [21]:
df.columns
Out[21]:
['class',
 'age',
 'menopause',
 'tumor-size',
 'inv-nodes',
 'node-caps',
 'deg-malig',
 'breast',
 'breast-quad',
 'irradiat',
 'label',
 'age-index',
 'age-vector',
 'menopause-index',
 'menopause-vector',
 'tumor-size-index',
 'tumor-size-vector',
 'inv-nodes-index',
 'inv-nodes-vector',
 'node-caps-index',
 'node-caps-vector',
 'breast-index',
 'breast-vector',
 'breast-quad-index',
 'breast-quad-vector',
 'irradiat-index',
 'irradiat-vector']

We select only columns as inputCols that we need to feed to our Spark ML model. Let us define a VectorAssembler:

In [22]:
inputCols=[
        'deg-malig',
        'age-vector',
        'menopause-vector',
        'tumor-size-vector',
        'inv-nodes-vector',
        'node-caps-vector',
        'breast-vector',
        'breast-quad-vector',
        'irradiat-vector']
In [23]:
df_va = VectorAssembler(inputCols = inputCols, outputCol="features")

Now, we can transform dataset with our VectorAssembler. It will outputCol 'features' as we stated in our VectorAssembler.

In [24]:
df = df_va.transform(df)

let us check the input and output columns

In [25]:
df.select(inputCols + ["features"] ).show(5)
+---------+-------------+----------------+-----------------+----------------+----------------+-------------+------------------+---------------+--------------------+
|deg-malig|   age-vector|menopause-vector|tumor-size-vector|inv-nodes-vector|node-caps-vector|breast-vector|breast-quad-vector|irradiat-vector|            features|
+---------+-------------+----------------+-----------------+----------------+----------------+-------------+------------------+---------------+--------------------+
|        3|(5,[3],[1.0])|   (2,[0],[1.0])|   (10,[0],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|(1,[0],[1.0])|     (5,[0],[1.0])|  (1,[0],[1.0])|(33,[0,4,6,8,18,2...|
|        2|(5,[1],[1.0])|   (2,[0],[1.0])|   (10,[2],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|    (1,[],[])|     (5,[2],[1.0])|  (1,[0],[1.0])|(33,[0,2,6,10,18,...|
|        2|(5,[1],[1.0])|   (2,[0],[1.0])|   (10,[2],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|(1,[0],[1.0])|     (5,[0],[1.0])|  (1,[0],[1.0])|(33,[0,2,6,10,18,...|
|        2|(5,[2],[1.0])|   (2,[1],[1.0])|   (10,[3],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|    (1,[],[])|     (5,[1],[1.0])|  (1,[0],[1.0])|(33,[0,3,7,11,18,...|
|        2|(5,[1],[1.0])|   (2,[0],[1.0])|   (10,[7],[1.0])|   (6,[0],[1.0])|   (2,[0],[1.0])|    (1,[],[])|     (5,[3],[1.0])|  (1,[0],[1.0])|(33,[0,2,6,15,18,...|
+---------+-------------+----------------+-----------------+----------------+----------------+-------------+------------------+---------------+--------------------+
only showing top 5 rows

As we need only 'features' and 'label' columns for our model, as the data from other columns have been merged into 'features' column and 'label' is our target, let us list and see a few records.

Use False flag to avoid truncation.

In [26]:
df.select(['features','label']).show(10,False)
+--------------------------------------------------------------------+-----+
|features                                                            |label|
+--------------------------------------------------------------------+-----+
|(33,[0,4,6,8,18,24,26,27,32],[3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0  |
|(33,[0,2,6,10,18,24,29,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(33,[0,2,6,10,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,3,7,11,18,24,28,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(33,[0,2,6,15,18,24,30,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
|(33,[0,3,7,11,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,1,6,9,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0  |
|(33,[0,3,7,10,18,24,26,27,32],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,2,6,16,18,24,26,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |
|(33,[0,2,6,10,18,24,28,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])       |0.0  |
+--------------------------------------------------------------------+-----+
only showing top 10 rows

So, finally we create a new dataset with just these two columns.

Let us view label wise dataframe

In [27]:
df_transformed = df.select(['features','label'])
df_transformed.show(5)
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(33,[0,4,6,8,18,2...|  0.0|
|(33,[0,2,6,10,18,...|  0.0|
|(33,[0,2,6,10,18,...|  0.0|
|(33,[0,3,7,11,18,...|  0.0|
|(33,[0,2,6,15,18,...|  0.0|
+--------------------+-----+
only showing top 5 rows

In [28]:
df_transformed.groupBy('label').count().show()
+-----+-----+
|label|count|
+-----+-----+
|  0.0|  201|
|  1.0|   85|
+-----+-----+

Split data into train and test

In [29]:
#split the data 
train_df, test_df = df_transformed.randomSplit([0.75,0.25])

We have split data into 75, 25 ration.

In [30]:
train_df.count()
Out[30]:
203
In [31]:
test_df.count()
Out[31]:
83
In [32]:
train_df.groupBy('label').count().show()
+-----+-----+
|label|count|
+-----+-----+
|  0.0|  143|
|  1.0|   60|
+-----+-----+

In [33]:
test_df.groupBy('label').count().show()
+-----+-----+
|label|count|
+-----+-----+
|  0.0|   58|
|  1.0|   25|
+-----+-----+

Create model and train

We are going to use LogisticRegression model as we have a binary classification problem with only two possible values: 0 and 1.

In [34]:
from pyspark.ml.classification import LogisticRegression
In [35]:
model = LogisticRegression(labelCol='label')
model
Out[35]:
LogisticRegression_42b9c284b7bf

Now, it is time to train our model:

In [36]:
trained_model = model.fit(train_df)

Evaluate the model

Let us get some predictions with our trained model. We will use training data first.

In [37]:
train_predictions = trained_model.evaluate(train_df).predictions
train_predictions.show(5)
+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(33,[0,1,6,8,18,2...|  0.0|[0.62431001631163...|[0.65119815873492...|       0.0|
|(33,[0,1,6,8,18,2...|  1.0|[-0.3250981441762...|[0.41943379599165...|       1.0|
|(33,[0,1,6,8,18,2...|  0.0|[2.48623618717003...|[0.92317127501159...|       0.0|
|(33,[0,1,6,8,18,2...|  1.0|[0.78774874575127...|[0.68734773835533...|       0.0|
|(33,[0,1,6,8,19,2...|  0.0|[-0.2770333873837...|[0.43118123021946...|       1.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows

As you see, our model made a few mistakes, like in case of 4th and 5th records, the labels and predictions don't match.

Using multiple filters, we can count various metrics. Let us first count 0 and 1 in lables:

In [38]:
train_df_count_1 = train_df.filter(train_df['label'] == 1).count()
train_df_count_0 = train_df.filter(train_df['label'] == 0).count()
train_df_count_1, train_df_count_0
Out[38]:
(60, 143)

Correct predictions:

In [39]:
cp = train_predictions.filter(
    train_predictions['label'] == 1).filter(
    train_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("Correct predictions: ", cp.count())

accuracy = (cp.count()) /  train_df_count_1
print(f"Accuracy: {accuracy}\n")


cp.show(5,False)
Correct predictions:  27
Accuracy: 0.45

+-----+----------+------------------------------------------+
|label|prediction|probability                               |
+-----+----------+------------------------------------------+
|1.0  |1.0       |[0.4194337959916568,0.5805662040083431]   |
|1.0  |1.0       |[0.2846297583189108,0.7153702416810892]   |
|1.0  |1.0       |[0.26444739242362697,0.7355526075763731]  |
|1.0  |1.0       |[2.2495127232620732E-54,1.0]              |
|1.0  |1.0       |[1.2848308887878324E-9,0.9999999987151691]|
+-----+----------+------------------------------------------+
only showing top 5 rows

False positive:

In [50]:
fp = train_predictions.filter(
    train_predictions['label'] == 0).filter(
    train_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("False positive: ", fp.count())

fp.show(5,False)
False positive:  8
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|0.0  |1.0       |[0.4311812302194633,0.5688187697805366] |
|0.0  |1.0       |[0.4327038561598543,0.5672961438401457] |
|0.0  |1.0       |[0.3618244073419616,0.6381755926580385] |
|0.0  |1.0       |[0.4407035684365973,0.5592964315634027] |
|0.0  |1.0       |[0.34937919764112246,0.6506208023588776]|
+-----+----------+----------------------------------------+
only showing top 5 rows

False negative:

In [51]:
fn = train_predictions.filter(
    train_predictions['label'] == 1).filter(
    train_predictions['prediction'] == 0).select(
    ['label','prediction','probability'])

print("False negative: ", fn.count())

fn.show(5,False)
False negative:  33
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1.0  |0.0       |[0.687347738355331,0.31265226164466897] |
|1.0  |0.0       |[0.6021960755759111,0.3978039244240888] |
|1.0  |0.0       |[0.6021960755759111,0.3978039244240888] |
|1.0  |0.0       |[0.6917036321325896,0.3082963678674105] |
|1.0  |0.0       |[0.7863190184311537,0.21368098156884627]|
+-----+----------+----------------------------------------+
only showing top 5 rows

Predict test data

In [43]:
test_predictions = trained_model.evaluate(test_df).predictions
test_predictions.show(5, False)
+-------------------------------------------------------------+-----+----------------------------------------+----------------------------------------+----------+
|features                                                     |label|rawPrediction                           |probability                             |prediction|
+-------------------------------------------------------------+-----+----------------------------------------+----------------------------------------+----------+
|(33,[0,1,6,9,18,24,28,32],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |1.0  |[2.651981706440889,-2.651981706440889]  |[0.9341330269140776,0.06586697308592242]|0.0       |
|(33,[0,1,6,9,19,24,28],[2.0,1.0,1.0,1.0,1.0,1.0,1.0])        |0.0  |[0.7680940005109493,-0.7680940005109493]|[0.6831084436250248,0.3168915563749752] |0.0       |
|(33,[0,1,6,9,19,25,26,27],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) |0.0  |[-0.2294611813953613,0.2294611813953613]|[0.44288508827087586,0.5571149117291241]|1.0       |
|(33,[0,1,6,11,18,24,27,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |[1.8714533774539035,-1.8714533774539035]|[0.8666263557976371,0.13337364420236295]|0.0       |
|(33,[0,1,6,12,18,24,28,32],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0  |[109.61465683616704,-109.61465683616704]|[1.0,2.4829009824698132E-48]            |0.0       |
+-------------------------------------------------------------+-----+----------------------------------------+----------------------------------------+----------+
only showing top 5 rows

Using multiple filters, we can count various metrics. Let us first count 0 and 1 in lables:

In [44]:
test_df_count_1 = test_df.filter(test_df['label'] == 1).count()
test_df_count_0 = test_df.filter(test_df['label'] == 0).count()
test_df_count_1, test_df_count_0
Out[44]:
(25, 58)

Correct predictions:

In [52]:
cp = test_predictions.filter(
    test_predictions['label'] == 1).filter(
    test_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("Correct predictions: ", cp.count())

accuracy = (cp.count()) /  test_df_count_1
print(f"Accuracy: {accuracy}\n")


cp.show(5,False)
Correct predictions:  8
Accuracy: 0.32

+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1.0  |1.0       |[0.4744813212766959,0.5255186787233042] |
|1.0  |1.0       |[0.4265782884237564,0.5734217115762436] |
|1.0  |1.0       |[0.3013797841578714,0.6986202158421286] |
|1.0  |1.0       |[0.22766040693175582,0.7723395930682442]|
|1.0  |1.0       |[0.3704657803798026,0.6295342196201974] |
+-----+----------+----------------------------------------+
only showing top 5 rows

False positive:

In [53]:
fp = test_predictions.filter(
    test_predictions['label'] == 0).filter(
    test_predictions['prediction'] == 1).select(
    ['label','prediction','probability'])

print("False positive: ", fp.count())

fp.show(5,False)
False positive:  10
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|0.0  |1.0       |[0.44288508827087586,0.5571149117291241]|
|0.0  |1.0       |[7.257664043789407E-55,1.0]             |
|0.0  |1.0       |[7.457967610236931E-55,1.0]             |
|0.0  |1.0       |[0.47825658756012884,0.5217434124398712]|
|0.0  |1.0       |[0.26079650858208653,0.7392034914179135]|
+-----+----------+----------------------------------------+
only showing top 5 rows

False negative:

In [54]:
fn = test_predictions.filter(
    test_predictions['label'] == 1).filter(
    test_predictions['prediction'] == 0).select(
    ['label','prediction','probability'])

print("False negative: ", fn.count())

fn.show(5,False)
False negative:  17
+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1.0  |0.0       |[0.9341330269140776,0.06586697308592242]|
|1.0  |0.0       |[0.5382947168966318,0.4617052831033681] |
|1.0  |0.0       |[0.7952431322320255,0.20475686776797455]|
|1.0  |0.0       |[0.8082534871399831,0.19174651286001684]|
|1.0  |0.0       |[0.5264577147149769,0.473542285285023]  |
+-----+----------+----------------------------------------+
only showing top 5 rows

In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing