1. "Transformer和RNN" 2. "视觉语言表征学习器" 3. "多样化的描述生成框架" 4. "RWKV-CLIP模型"