Figure 4. Overall architecture of PET for corn ear kernel localization. (IMAGE)
Caption
First, a Convolutional Neural Network (CNN) backbone is used to extract the image features. Then, a transformer encoder with progressive rectangle window attention is applied to these features to encode contextual information. Next, a quadtree splitter takes sparse querying points and the encoded features as input and outputs a point-query quadtree. After that, a transformer decoder decodes these point queries in parallel, with attention computed within a local window. Finally, these point queries are passed through a prediction head to obtain kernel predictions, i.e., whether it is “no kernel” or “kernel” along with its probability and localization.
Credit
The authors
Usage Restrictions
Credit must be given to the creator.
License
CC BY