NullPointerException while compute accuracy with ComputeModelStatistics #736

ttpro1995 · 2019-11-13T04:23:26Z

Version

com.microsoft.ml.spark:mmlspark_2.11:jar:0.18.1
spark= 2.4.3
scala=2.11.12

data (csv with header) https://gist.github.com/ttpro1995/69051647a256af912803c9a16040f43a

download data and save as csv file, put into folder /data/public/HIGGS/higgs.test.predictioncsv

val data = spark.read.option("header","true").option("inferSchema", "true").csv("/data/public/HIGGS/higgs.test.predictioncsv")

Schema

root
 |-- label: double (nullable = true)
 |-- prediction: double (nullable = true)

Code

import com.microsoft.ml.spark.train.ComputeModelStatistics
val metricsCompute = new ComputeModelStatistics().setLabelCol("label").setScoresCol("prediction").setEvaluationMetric("accuracy")

val result_metrics = metricsCompute.transform(data)

Exception

java.lang.NullPointerException
  at org.apache.spark.sql.Column.<init>(Column.scala:135)
  at org.apache.spark.sql.Column$.apply(Column.scala:38)
  at org.apache.spark.sql.functions$.col(functions.scala:90)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.selectAndCastToDF(ComputeModelStatistics.scala:256)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.selectAndCastToRDD(ComputeModelStatistics.scala:265)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.predictionAndLabels$lzycompute$1(ComputeModelStatistics.scala:99)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.predictionAndLabels$1(ComputeModelStatistics.scala:95)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.transform(ComputeModelStatistics.scala:124)
  ... 47 elided

The text was updated successfully, but these errors were encountered:

imatiach-msft · 2019-11-13T05:12:57Z

@ttpro1995 thank you for the detailed repro. Will take a look at this when I get a chance. Can you please post how you read the "data" variable just so I'm not missing any of the parts that reproduces the issue? My first guess is there might be some missing values in the dataset.

ttpro1995 · 2019-11-13T07:24:40Z

code that work in python sklearn (so, data is not broken)

from sklearn.metrics import  accuracy_score
import pandas as pd
data = pd.read_csv("/data/public/HIGGS/higgs.test.predictioncsv/part-00000-2cf6c2c8-173c-49c1-ae78-9dfe697f3a0f-c000.csv")
data.head()
accuracy_score(data["label"], data["prediction"])

ttpro1995 · 2019-11-14T04:37:53Z

Another case on Spark Scala with error using different data

Notebook on zepl.com
https://www.zepl.com/viewer/notebooks/bm90ZTovL2hhaGF0dHByby9iMzBiYjJiNzZkMGI0YWVlYTQwMjMwY2JiMzNkYWE0MC9ub3RlLmpzb24

**Export to zeppelin (0.8.x): **
https://gist.github.com/ttpro1995/8e6227039b2593041267180a1d81efa7

Data using in notebook:

binary.train

binary.test

imatiach-msft · 2019-11-14T04:49:00Z

@ttpro1995 oh sorry, I just saw the problem now, you are setting the scores column but not the scored labels column, which is required if you are trying to compute all metrics:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/train/ComputeModelStatistics.scala#L99
I think the error message needs to be improved

ttpro1995 · 2019-11-14T07:09:07Z

I don't know what each function, each set col do.
i.e: what setLabelCol, setScoredLabelsCol, setScoresCol ... these name are similar and confusing, especially setScoredLabelsCol and setScoresCol.

@ttpro1995 oh sorry, I just saw the problem now, you are setting the scores column but not the scored labels column, which is required if you are trying to compute all metrics:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/train/ComputeModelStatistics.scala#L99

What should be setScoredLabelsCol ?

I suggest there should be more detail in documentation . Including: meaning of each parameter, default value (I have to new a object in code, then use get function to know the default value),

imatiach-msft · 2019-11-15T05:39:39Z

@ttpro1995 I think the column name convention is similar, but maybe not exactly, in most ML platforms, at least in scikit-learn, apache spark and ML.NET. The labelCol is the true labels or original label from the dataset, the ScoredLabelsCol is the label assigned by the classifier, and the ScoresCol is the raw prediction, for example the distance from the separating hyperplane for the support vector machine classifier. There's also usually a probability col, which is not used in ComputeModelStatistics, although you could pass it as the input to the ScoresCol too (it's used for the AUC metric).

imatiach-msft · 2019-11-15T05:40:56Z

@ttpro1995 "I suggest there should be more detail in documentation . Including: meaning of each parameter, default value (I have to new a object in code, then use get function to know the default value),"
totally agree, the documentation could definitely be improved here. thanks for pointing that out.

ttpro1995 changed the title ~~NullPointerException while compute AUC with ComputeModelStatistics~~ NullPointerException while compute accuracy with ComputeModelStatistics Nov 13, 2019

imatiach-msft self-assigned this Nov 13, 2019

imatiach-msft added awaiting response bug high priority labels Nov 13, 2019

ttpro1995 mentioned this issue Nov 13, 2019

Question: what is a list of available metrics for ComputeModelStatistics().setEvaluationMetric() #734

Open

akshaya-a added documentation help wanted good first issue and removed awaiting response high priority labels Mar 7, 2020

microsoft / SynapseML Public

NullPointerException while compute accuracy with ComputeModelStatistics #736

NullPointerException while compute accuracy with ComputeModelStatistics #736

ttpro1995 commented Nov 13, 2019 •

edited

imatiach-msft commented Nov 13, 2019 •

edited

ttpro1995 commented Nov 13, 2019

ttpro1995 commented Nov 14, 2019 •

edited

imatiach-msft commented Nov 14, 2019

ttpro1995 commented Nov 14, 2019 •

edited

imatiach-msft commented Nov 15, 2019

imatiach-msft commented Nov 15, 2019

microsoft / SynapseML Public

NullPointerException while compute accuracy with ComputeModelStatistics #736

NullPointerException while compute accuracy with ComputeModelStatistics #736

Comments

ttpro1995 commented Nov 13, 2019 • edited

imatiach-msft commented Nov 13, 2019 • edited

ttpro1995 commented Nov 13, 2019

ttpro1995 commented Nov 14, 2019 • edited

imatiach-msft commented Nov 14, 2019

ttpro1995 commented Nov 14, 2019 • edited

imatiach-msft commented Nov 15, 2019

imatiach-msft commented Nov 15, 2019

ttpro1995 commented Nov 13, 2019 •

edited

imatiach-msft commented Nov 13, 2019 •

edited

ttpro1995 commented Nov 14, 2019 •

edited

ttpro1995 commented Nov 14, 2019 •

edited