Illustration of two types of model architectures for VLP (IMAGE)
Caption
The single-stream architecture refers to that the text and visual features are concatenated together, then fed into a single transformer block, as shown in Fig. 1(a). The dual-stream architecture refers to that the text and visual features are not concatenated together but sent to two different transformer blocks independently, as shown in Fig. 1(b).
Credit
Beijing Zhongke Journal Publising Co. Ltd.
Usage Restrictions
Credit must be given to the creator.
License
CC BY