图像字幕生成
图像描述任务(Image Captioning)旨在通过自然语言生成技术对输入图像的内容进行准确的文字描述。该任务结合了计算机视觉与自然语言处理领域的技术,通常采用编码器-解码器框架,将图像信息转化为中间表示,再解码生成描述性文本。主要评估指标包括BLEU和CIDER,常用数据集有nocaps和COCO。图像描述在辅助视觉障碍者理解图像、自动化内容标注及智能图像搜索等领域具有重要应用价值。
VizWiz 2020 test-dev
nocaps in-domain
VinVL (Microsoft Cognitive Services + MSR)
COCO Captions
mPLUG
nocaps near-domain
GIT2, Single Model
nocaps out-of-domain
PaLI
nocaps entire
MS COCO
ExpansionNet v2
VizWiz 2020 test
nocaps-XD entire
GIT
nocaps-val-in-domain
nocaps-val-overall
nocaps-XD in-domain
GIT2
nocaps-XD near-domain
GIT2
nocaps-XD out-of-domain
GIT2
TextCaps 2020
nocaps-val-near-domain
nocaps-val-out-domain
SCICAP
CNN+LSTM (Vision only, First sentence)
Flickr30k Captions test
Unified VLP
WHOOPS!
nocaps val
Prismer
Object HalBench
COCO Captions test
From Captions to Visual Concepts and Back
Conceptual Captions
ClipCap (MLP + GPT2 tuning)
FlickrStyle10K
CapDec
Localized Narratives
AIC-ICC
BanglaLekhaImageCaptions
CNN + 1D CNN
ChEBI-20
GIT-Mol
IU X-Ray
MS-COCO
NeuSyRE
MSCOCO
CapDec
Peir Gross
BiomedGPT
foundation-multimodal-models/DetailCaps-4870