Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException while compute accuracy with ComputeModelStatistics #736

Open
ttpro1995 opened this issue Nov 13, 2019 · 7 comments
Open

Comments

@ttpro1995
Copy link

@ttpro1995 ttpro1995 commented Nov 13, 2019

Version

com.microsoft.ml.spark:mmlspark_2.11:jar:0.18.1
spark= 2.4.3
scala=2.11.12

data (csv with header) https://gist.github.com/ttpro1995/69051647a256af912803c9a16040f43a

download data and save as csv file, put into folder /data/public/HIGGS/higgs.test.predictioncsv

val data = spark.read.option("header","true").option("inferSchema", "true").csv("/data/public/HIGGS/higgs.test.predictioncsv")

Schema

root
 |-- label: double (nullable = true)
 |-- prediction: double (nullable = true)

Code

import com.microsoft.ml.spark.train.ComputeModelStatistics
val metricsCompute = new ComputeModelStatistics().setLabelCol("label").setScoresCol("prediction").setEvaluationMetric("accuracy")

val result_metrics = metricsCompute.transform(data)

Exception

java.lang.NullPointerException
  at org.apache.spark.sql.Column.<init>(Column.scala:135)
  at org.apache.spark.sql.Column$.apply(Column.scala:38)
  at org.apache.spark.sql.functions$.col(functions.scala:90)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.selectAndCastToDF(ComputeModelStatistics.scala:256)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.selectAndCastToRDD(ComputeModelStatistics.scala:265)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.predictionAndLabels$lzycompute$1(ComputeModelStatistics.scala:99)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.predictionAndLabels$1(ComputeModelStatistics.scala:95)
  at com.microsoft.ml.spark.train.ComputeModelStatistics.transform(ComputeModelStatistics.scala:124)
  ... 47 elided
@ttpro1995 ttpro1995 changed the title NullPointerException while compute AUC with ComputeModelStatistics NullPointerException while compute accuracy with ComputeModelStatistics Nov 13, 2019
@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Nov 13, 2019

@ttpro1995 thank you for the detailed repro. Will take a look at this when I get a chance. Can you please post how you read the "data" variable just so I'm not missing any of the parts that reproduces the issue? My first guess is there might be some missing values in the dataset.

@ttpro1995
Copy link
Author

@ttpro1995 ttpro1995 commented Nov 13, 2019

code that work in python sklearn (so, data is not broken)

from sklearn.metrics import  accuracy_score
import pandas as pd
data = pd.read_csv("/data/public/HIGGS/higgs.test.predictioncsv/part-00000-2cf6c2c8-173c-49c1-ae78-9dfe697f3a0f-c000.csv")
data.head()
accuracy_score(data["label"], data["prediction"])

@ttpro1995
Copy link
Author

@ttpro1995 ttpro1995 commented Nov 14, 2019

Another case on Spark Scala with error using different data

Notebook on zepl.com
https://www.zepl.com/viewer/notebooks/bm90ZTovL2hhaGF0dHByby9iMzBiYjJiNzZkMGI0YWVlYTQwMjMwY2JiMzNkYWE0MC9ub3RlLmpzb24

**Export to zeppelin (0.8.x): **
https://gist.github.com/ttpro1995/8e6227039b2593041267180a1d81efa7

Data using in notebook:

binary.train

binary.test

@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Nov 14, 2019

@ttpro1995 oh sorry, I just saw the problem now, you are setting the scores column but not the scored labels column, which is required if you are trying to compute all metrics:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/train/ComputeModelStatistics.scala#L99
I think the error message needs to be improved

@ttpro1995
Copy link
Author

@ttpro1995 ttpro1995 commented Nov 14, 2019

I don't know what each function, each set col do.
i.e: what setLabelCol, setScoredLabelsCol, setScoresCol ... these name are similar and confusing, especially setScoredLabelsCol and setScoresCol.

@ttpro1995 oh sorry, I just saw the problem now, you are setting the scores column but not the scored labels column, which is required if you are trying to compute all metrics:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/train/ComputeModelStatistics.scala#L99

What should be setScoredLabelsCol ?


I suggest there should be more detail in documentation . Including: meaning of each parameter, default value (I have to new a object in code, then use get function to know the default value),

@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Nov 15, 2019

@ttpro1995 I think the column name convention is similar, but maybe not exactly, in most ML platforms, at least in scikit-learn, apache spark and ML.NET. The labelCol is the true labels or original label from the dataset, the ScoredLabelsCol is the label assigned by the classifier, and the ScoresCol is the raw prediction, for example the distance from the separating hyperplane for the support vector machine classifier. There's also usually a probability col, which is not used in ComputeModelStatistics, although you could pass it as the input to the ScoresCol too (it's used for the AUC metric).

@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Nov 15, 2019

@ttpro1995 "I suggest there should be more detail in documentation . Including: meaning of each parameter, default value (I have to new a object in code, then use get function to know the default value),"
totally agree, the documentation could definitely be improved here. thanks for pointing that out.

@akshaya-a akshaya-a added documentation help wanted good first issue and removed awaiting response high priority labels Mar 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants