描述leverage text-based instruction manuals and user guides to curate a multimodal dataset fuse text and visual features as input to the co-attention transformer layers,Status日期