Image Captioning
Image captioning aims to accurately describe the content of input images using natural language generation techniques. This task integrates technologies from both computer vision and natural language processing fields, typically employing an encoder-decoder framework to transform image information into intermediate representations, which are then decoded into descriptive texts. The primary evaluation metrics include BLEU and CIDER, while common datasets used for this purpose are nocaps and COCO. Image captioning holds significant application value in areas such as assisting visually impaired individuals in understanding images, automated content tagging, and intelligent image search.
VizWiz 2020 test-dev
nocaps in-domain
VinVL (Microsoft Cognitive Services + MSR)
COCO Captions
mPLUG
nocaps near-domain
GIT2, Single Model
nocaps out-of-domain
PaLI
nocaps entire
MS COCO
ExpansionNet v2
VizWiz 2020 test
nocaps-XD entire
GIT
nocaps-val-in-domain
nocaps-val-overall
nocaps-XD in-domain
GIT2
nocaps-XD near-domain
GIT2
nocaps-XD out-of-domain
GIT2
TextCaps 2020
nocaps-val-near-domain
nocaps-val-out-domain
SCICAP
CNN+LSTM (Vision only, First sentence)
Flickr30k Captions test
Unified VLP
WHOOPS!
nocaps val
Prismer
Object HalBench
COCO Captions test
From Captions to Visual Concepts and Back
Conceptual Captions
ClipCap (MLP + GPT2 tuning)
FlickrStyle10K
CapDec
Localized Narratives
AIC-ICC
BanglaLekhaImageCaptions
CNN + 1D CNN
ChEBI-20
GIT-Mol
IU X-Ray
MS-COCO
NeuSyRE
MSCOCO
CapDec
Peir Gross
BiomedGPT
foundation-multimodal-models/DetailCaps-4870