Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in AI and Deep Learning by (50.2k points)

I have gone through a couple of YOLO tutorials but I am finding it somewhat hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).

How can the small cell predict anchor boxes for an object bigger than it? Also, it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighboring cells if only a small part of the object falls within the cell

E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighboring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.

PS: What I currently think of how the YOLO works are basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class are then selected but I am sure it doesn't add up somewhere.

1 Answer

0 votes
by (107k points)

The training process of YOLO contains the learning with anchors to use for the object. So the "assignment" isn't deterministic. Because of this, multiple anchors will detect each object, and you need to do non-max-suppression afterward to pick the "best" one (i.e. highest confidence).

Anchors are decided by a k-means procedure, looking at all the bounding boxes in your dataset. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.

It's useful to have anchors that represent your dataset because YOLO learns how to make small adjustments to the anchor boxes to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.

If you want to learn more about convolution neural network then you can read blog on CNN.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...