描述vision-only approach : take screenshots and a region of interest (the “focus”) as input composed of a vision encoder and a language decoderStatus日期