描述A multimodal Transformer: images, structures and language→5 distinct tasks.(jointly encoders and trained) an auto-regressive transformer:language input↔command or question↔ans.Status日期