Systematically evaluating the factuality of large language models with the FACTS Benchmark Suite.
The FACTS Benchmark Suite is a comprehensive framework developed to assess the factual accuracy and grounding of large language models (LLMs). It includes the FACTS Grounding benchmark, which evaluates LLMs’ ability to generate factually accurate long-form responses grounded in provided context documents up to 32,000 tokens. (deepmind.google)
Several related benchmarks have been introduced to further evaluate the factuality of LLMs:
-
RelationalFactQA: This benchmark focuses on evaluating LLMs’ ability to retrieve and generate structured, multi-record tabular outputs from parametric knowledge. It highlights the challenges LLMs face in relational fact retrieval, especially as output dimensionality increases. (arxiv.org)
-
ReFACT: Designed for detecting scientific confabulation, ReFACT provides a dataset of 1,001 expert-annotated question-answer pairs across diverse scientific domains. Each instance includes both a correct and a non-factual answer, annotated with precise error spans and types, enabling multi-stage evaluation of confabulation detection, error localization, and correction. (arxiv.org)
-
VeriFact: This framework enhances long-form factuality evaluation by refining fact extraction and incorporating reference facts. It introduces FactRBench, a benchmark that evaluates both precision and recall in long-form model responses, addressing the limitations of prior work that primarily focused on precision. (arxiv.org)
-
FACTORY: A large-scale, human-verified prompt set developed using a model-in-the-loop approach, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. Human evaluations on state-of-the-art language models using FACTORY reveal that approximately 40% of the claims made in the responses are not factual, compared to only 10% for other datasets. (arxiv.org)
- Factcheck-Bench: This holistic end-to-end framework for annotating and evaluating the factuality of LLM-generated responses encompasses a multi-stage annotation scheme designed to yield detailed labels for fact-checking and correcting not just the final prediction, but also the intermediate steps that a fact-checking system might need to take. (aclanthology.org)
These benchmarks collectively contribute to the ongoing efforts to enhance the factual accuracy and reliability of large language models across various domains.
Discover DeepMind, a world-leading AI research lab by Google. Learn how it’s advancing science, healthcare, and technology through cutting-edge artificial intelligence breakthroughs..
