Skip to content

mAP is wrong if all scores are equal (=not providing a score) #46

Open
@voegtlel

Description

@voegtlel

(Copying this bug report from the main coco metrics cocodataset/cocoapi#678 )

Hi there,

Describe the bug

our detector does not output scores, thus we set all to 1, which gives wrong results using the coco metrics. We know, that the metrics are written assuming that there exist scores, but I believe it should be clearly clarified in the docs that the mAP is not correct if the scores are not set.

To Reproduce

More details and an analysis of the cause are following:

Example with source code

import faster_coco_eval

# Replace pycocotools with faster_coco_eval
faster_coco_eval.init_as_pycocotools()

import json
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval

if __name__ == "__main__":
    # GT
    gt = {
        "categories": [
            {"id": 1, "name": "a"},
        ],
        "annotations": [
            {"image_id": 1, "bbox": [0, 0, 10, 10], "category_id": 1, "id": 1, "iscrowd": 0, "area": 100, "segmentation": []},
            {"image_id": 1, "bbox": [20, 20, 30, 30], "category_id": 1, "id": 3, "iscrowd": 0, "area": 100, "segmentation": []},
            {"image_id": 1, "bbox": [30, 30, 40, 40], "category_id": 1, "id": 4, "iscrowd": 0, "area": 100, "segmentation": []},
        ],
        "images": [
            {"id": 1, "file_name": "image.jpg"},
        ],
    }
    with open("gt.json", "w") as f:
        json.dump(gt, f, indent=2)

    # Pred 1
    pred = [
        {"image_id": 1, "bbox": [0, 0, 10, 10], "category_id": 1, "score": 1, "id": 1, "segmentation": []},
        {"image_id": 1, "bbox": [10, 10, 20, 20], "category_id": 1, "score": 1, "id": 2, "segmentation": []},
        {"image_id": 1, "bbox": [20, 20, 30, 30], "category_id": 1, "score": 1, "id": 3, "segmentation": []},
    ]
    with open("pred1.json", "w") as f:
        json.dump(pred, f, indent=2)

    # Pred 2
    pred = [
        {"image_id": 1, "bbox": [0, 0, 10, 10], "category_id": 1, "score": 1, "id": 1, "segmentation": []},
        {"image_id": 1, "bbox": [20, 20, 30, 30], "category_id": 1, "score": 1, "id": 2, "segmentation": []},  # Swapped this box with the next
        {"image_id": 1, "bbox": [10, 10, 20, 20], "category_id": 1, "score": 1, "id": 3, "segmentation": []},
    ]
    with open("pred2.json", "w") as f:
        json.dump(pred, f, indent=2)

    coco = COCO("gt.json")

    pred = coco.loadRes("pred1.json")
    eval = COCOeval(coco, pred, 'bbox')
    eval.evaluate()
    eval.accumulate()
    eval.summarize()

    pred = coco.loadRes("pred2.json")
    eval = COCOeval(coco, pred, 'bbox')
    eval.evaluate()
    eval.accumulate()
    eval.summarize()

Output of the example source code

Output will be:

 [...]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.663
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 [...]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000

The cause for this is: For computing the AP, a discrete precision recall curve is computed. This curve is created prediction-by-prediction sorted by the score. But as the score is the same for all, they should actually be considered all at once, because there cannot be a different score threshold which excludes one prediction over the other (this should be independent of order).

Thus, the resulting PR-curves are different and not correct:

image
image

Reference code for plotting

import matplotlib.pyplot as plt
import numpy as np


def plot_pr_curves(eval_results, cats, output_dir="."):
    """
    Function to plot Precision-Recall curves based on the accumulated results from COCOeval.
    """
    # Extract the necessary evaluation parameters
    params = eval_results['params']
    precision = eval_results['precision']
    #recall = eval_results['recall']
    iouThrs = params.iouThrs  # IoU thresholds
    catIds = params.catIds    # Category IDs
    areaRngLbl = params.areaRngLbl  # Labels for area ranges
    recThrs = np.array(params.recThrs)  # Recall thresholds
    maxDets = params.maxDets  # Max detections

    k = 0  # category = a
    a = 0  # area range = all
    m = 2  # max detections = 100
    t = 0  # IoU threshold = 0.5

    pr = precision[t, :, k, a, m]

    # Create the plot
    plt.figure()
    plt.plot(recThrs, pr, marker='o', label=f"IoU={iouThrs[t]:.2f}")
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f"Precision-Recall Curve\nCategory: {cats[catIds[k]]['name']}, Area: {areaRngLbl[a]}, MaxDets: {maxDets[m]}")
    plt.legend()

    # Create a unique filename based on category, IoU, area, and maxDet
    plt.savefig(f"{output_dir}/PR_Curve_cat{cats[catIds[k]]['name']}_iou{iouThrs[t]:.2f}_area{areaRngLbl[a]}_maxDet{maxDets[m]}.png")
    plt.close()


if __name__ == "__main__":
    ...
    plot_pr_curves(eval.eval, coco.cats, "./")

The cause for this issue lies here: https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/cocoeval.py#L378-L379

where the tp_sum and fp_sum are computed as cumulative sum, but this is wrong if the scores are equal. Then the cumulative sum should contain all predictions. It may only increment if the score from one to the next prediction differs, otherwise all must be the same value or for efficiency be collapsed.

Expected behavior

If there is no score, the pr-curve is reduced to a single point (precision, recall) mean of all predictions, as there is no score separating the predictions. Thus, the Average Precision equals the Precision.
Effectively, this fix could be added on top of the current implementation (e.g. a switch which allows for equal scores) in order not to modify the existing code.

Not sure, if faster-coco-eval aims to be equivalent to cocoapi and if this is even an option. But cocoapi seems to be stale, lots of PRs, no changes since 4 years.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions