Network overview (IMAGE)
Caption
The input image is first split into overlapping patches. Then, those patches go through tokens reduction block and main transformer to learn features with global information. To abstract global information, context token (blue vector) is added to the input sequence before the main transformer. The encoded features are processed by TAM and regression-token module (RTM). The small decoder after TAM is not shown for simplicity.
Credit
Beijing Zhongke Journal Publising Co. Ltd.
Usage Restrictions
Credit must be given to the creator.
License
CC BY