The MSR-VTT-QA dataset is a benchmark for the task of Visual Question Answering (VQA) on the MSR-VTT (Microsoft Research Video to Text) dataset. The MSR-VTT-QA benchmark is used to evaluate models on their ability to answer questions based on these videos. It's part of the tasks that this dataset is used for, along with Video Retrieval, Video Captioning, Zero-Shot Video Question Answering, Zero-Shot Video Retrieval, and Text-to-Video Generation.
The ConceptARC dataset is a benchmark for evaluating understanding and generalization in the Abstraction and Reasoning Corpus (ARC) domain. It was developed by Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ability to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bo
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and
EgoSchema is very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior.
The current state-of-the-art on VTAB-1k(Specialized<4>) is SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K). See a full comparison of 10 papers with code.