描述Based on the self-aligned characteristics between components of different modalities a well-designed joint image-text modelStatus日期