Luozm +

Different Tasks in Computer Vision

Notes about the definitions of ImageNet challenges and beyond.

原创文章,转载请注明:转自Luozm's Blog

The definitions of the ImageNet (ILSRVC) challenges really confused me. So I decided to figure it out.

Overview

In Computer Vision (CV) area, there are many different tasks: Image Classification, Object Localization, Object Detection, Semantic Segmentation, Instance Segmentation, Image captioning, etc.. Some of them are difficult to distinguish for beginners.

cv

PS: most of the slices in the post are from CS231n1.

1 Image Classification

Image Classification problem is the task of assigning an input image one label from a fixed set of categories. This is one of the core problems in CV that, despite its simplicity, has a large variety of practical applications. Moreover, as we will see later, many other seemingly distinct CV tasks (such as object detection, segmentation) can be reduced to image classification.

classification

For example, in the image below an image classification model takes a single image and assigns probabilities to 4 labels, {cat, dog, hat, mug}. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red,Green,Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers, or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as “cat”.

classification_example

1.1 Classification in ImageNet

The definition of Image Classification in ImageNet is:

For each image, algorithms will produce a list of at most 5 object categories in the descending order of confidence. The quality of a labeling will be evaluated based on the label that best matches the ground truth label for the image. The idea is to allow an algorithm to identify multiple objects in an image and not be penalized if one of the objects identified was in fact present, but not included in the ground truth. For each image, an algorithm will produce 5 labels $ l_j, j=1,…,5 $. The ground truth labels for the image are $ g_k, k=1,…,n $ with n classes of objects labeled. The error of the algorithm for that image would be

where $ d(x,y)=0 $ if $ x=y $ and 1 otherwise. The overall error score for an algorithm is the average error over all test images. Note that for this version of the competition, $n=1$, that is, one ground truth label per image.

1.2 Typical solutions & models

The image classification pipeline: We’ve seen that the task in Image Classification is to take an array of pixels that represents a single image and assign a label to it. Our complete pipeline can be formalized as follows:

Models: There are many models to solve Image classification problem. ic_models

2 Object Localization

In fact, this is the most confusing task when I first look at ImageNet challenges.

This is a sort of intermediate task in between other two ILSRVC tasks, image classification and object detection. In image classification you have to assign a single label to an image corresponding to the “main” object (eventually, the image can contain multiple objects). The classification + localization requires also to localize a single instance of this object, even if the image contains multiple instances of it. This task is also called “single-instance localization”.2

loc

While It’s pretty easy for people to identify subtle differences in photos, computers still have a ways to go. Visually similar items are tough for computers to count. For instance, consider this photo of a family of foxes camouflaged in the wild - where do the foxes end and where does the grass begins?

loc_example

2.1 LOC in ImageNet

The definition of localization in ImageNet is:

In this task, an algorithm will produce 5 class labels $ l_j, j=1,…,5 $ and 5 bounding boxes $ b_j, j=1,…5 $, one for each class label. The ground truth labels for the image are $ g_k, k=1,…,n $ with n classes labels. For each ground truth class label $g_k$, the ground truth bounding boxes are $ z_{km}, m=1,…M_k, $ where $M_k$ is the number of instances of the $k^{th}$ object in the current image. The error of the algorithm for that image would be

where $ f(b_j, z_k)=0 $ if $b_j$ and $z_{mk}$ has over 50% overlap, and $ f(b_j,z_{mk})=1 $ otherwise. In other words, the error will be the same as defined in classification task if the localization is correct(i.e. the predicted bounding box overlaps over 50% with the ground truth bounding box, or in the case of multiple instances of the same class, with any of the ground truth bounding boxes), otherwise the error is 1(maximum).

loc_task

2.2 Typical solutions & models

See more detailed solutions on CS231n(16Winter): lecture 83.

3 Object Detection

Object detection is the process of finding instances of real-world objects such as faces, bicycles, and buildings in images or videos. Object detection algorithms typically use extracted features and learning algorithms to recognize instances of an object category. It is commonly used in applications such as image retrieval, security, surveillance, and automated vehicle parking systems.4

detection_example

3.1 Detection in ImageNet

The definition of detection in ImageNet is:

For each image, algorithms will produce a set of annotations $(c_i, s_i, b_i)$ of class labels $c_i$, confidence scores $s_i$ and bounding boxes $b_i$. This set is expected to contain each instance of each of the 200 object categories. Objects which were not annotated will be penalized, as will be duplicate detections (two annotations for the same object instance). The winner of the detection challenge will be the team which achieves first place accuracy on the most object categories.

3.2 Typical solutions & models

See more on CS231n(17Spring): lecture 115 and Object Localization and Detection6.

4 Segmentation

There are two kinds of segmentation tasks in CV: Semantic Segmentation & Instance Segmentation. The difference between them is on Instance Segmentation 比 Semantic Segmentation 难很多吗?.

semantic_segmentation instance_segmentation

4.1 Typical solutions & models

See more details on Image Segmentation7, Semantic Segmentation8, and really-awesome-semantic-segmentation9. ss_solution

References

© Luozm . All rights reserved. | Top

Life

Theory

Develop