General architecture of the proposed framework (IMAGE)
Caption
hv( ) and ht( ) represent the image encoder and text encoder respectively, which are used to extract features from images and texts. hv( ) and ht( ) are their corresponding momentum encoders, which are employed to provide rich negative samples. Then, the extracted features are fed into GPO aggregators to obtain the holistic embeddings. gv( ) and gt( ) denote GPO aggregators for image and text modalities, respectively. gv( ) and gt( ) are their corresponding momentum aggregators. To learn adequate alignment relationships between different modalities, we aligned the aggregated features in the alignment module, which contains three objectives: image-text contrastive learning (ITC), intra-modal separability (IMS), and local mutual information maximization (LMIM). Finally, we incorporated a multimodal fusion encoder hf( ) at the end of our model to explore the interaction information between different modalities. Details of the image encoder, text encoder, and fusion encoder are described on the right side.
Credit
Beijing Zhongke Journal Publising Co. Ltd.
Usage Restrictions
Credit must be given to the creator.
License
CC BY