Timezone: »
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which -- despite their importance to practitioners -- have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axis of evaluation or dataset. As state-of-the-art NLP models push the limits of traditional benchmarks, Dynaboard offers a standardized solution for a more diverse and comprehensive evaluation of model quality.
Author Information
Zhiyi Ma (Facebook)
Kawin Ethayarajh (Stanford University)
Tristan Thrush (Facebook)
Somya Jain (Facebook)
Ledell Wu (Facebook, Inc.)
Robin Jia (Facebook)
Christopher Potts (Stanford University)
Adina Williams (Facebook AI Research)
Douwe Kiela (Facebook AI Research)
More from the Same Authors
-
2021 : ReaSCAN: Compositional Reasoning in Language Grounding »
Zhengxuan Wu · Elisa Kreiss · Desmond Ong · Christopher Potts -
2021 Spotlight: Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval »
Omar Khattab · Christopher Potts · Matei Zaharia -
2022 : Perturbation Augmentation for Fairer NLP »
Rebecca Qian · Candace Ross · Jude Fernandes · Eric Michael Smith · Douwe Kiela · Adina Williams -
2022 Poster: CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior »
Eldar D Abraham · Karel D'Oosterlinck · Amir Feder · Yair Gat · Atticus Geiger · Christopher Potts · Roi Reichart · Zhengxuan Wu -
2021 : Facebook - Data Centric Infrastructure »
Douwe Kiela -
2021 : Intuitive Image Descriptions are Context-Sensitive »
Shayan Hooshmand · Elisa Kreiss · Christopher Potts -
2021 Poster: Causal Abstractions of Neural Networks »
Atticus Geiger · Hanson Lu · Thomas Icard · Christopher Potts -
2021 Poster: True Few-Shot Learning with Language Models »
Ethan Perez · Douwe Kiela · Kyunghyun Cho -
2021 : Invited talk - Douwe Kiela »
Douwe Kiela -
2021 Poster: Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP »
Josh Rozner · Christopher Potts · Kyle Mahowald -
2021 Poster: Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval »
Omar Khattab · Christopher Potts · Matei Zaharia -
2021 : Introduction Competion Day 1 »
Douwe Kiela -
2021 Poster: Human-Adversarial Visual Question Answering »
Sasha Sheng · Amanpreet Singh · Vedanuj Goswami · Jose Magana · Tristan Thrush · Wojciech Galuba · Devi Parikh · Douwe Kiela -
2020 : Q & A and Panel Session with Dan Weld, Kristen Grauman, Scott Yih, Emma Brunskill, and Alex Ratner »
Kristen Grauman · Wen-tau Yih · Alexander Ratner · Emma Brunskill · Douwe Kiela · Daniel S. Weld -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2020 : Opening Remarks »
Divyansh Kaushik · Bhargavi Paranjape · Douwe Kiela -
2020 : The Hateful Memes Challenge: Live award ceremony and winner presentations »
Douwe Kiela -
2020 : The Hateful Memes Challenge: Competition Overview »
Douwe Kiela -
2020 Poster: The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes »
Douwe Kiela · Hamed Firooz · Aravind Mohan · Vedanuj Goswami · Amanpreet Singh · Pratik Ringshia · Davide Testuggine -
2020 Poster: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks »
Patrick Lewis · Ethan Perez · Aleksandra Piktus · Fabio Petroni · Vladimir Karpukhin · Naman Goyal · Heinrich Küttler · Mike Lewis · Wen-tau Yih · Tim Rocktäschel · Sebastian Riedel · Douwe Kiela -
2020 Poster: Learning Optimal Representations with the Decodable Information Bottleneck »
Yann Dubois · Douwe Kiela · David Schwab · Ramakrishna Vedantam -
2020 Spotlight: Learning Optimal Representations with the Decodable Information Bottleneck »
Yann Dubois · Douwe Kiela · David Schwab · Ramakrishna Vedantam -
2019 : Audrey Durand, Douwe Kiela, Kamalika Chaudhuri moderated by Yann Dauphin »
Audrey Durand · Kamalika Chaudhuri · Yann Dauphin · Orhan Firat · Dilan Gorur · Douwe Kiela -
2019 : Douwe Kiela - Benchmarking Progress in AI: A New Benchmark for Natural Language Understanding »
Douwe Kiela -
2019 Workshop: Emergent Communication: Towards Natural Language »
Abhinav Gupta · Michael Noukhovitch · Cinjon Resnick · Natasha Jaques · Angelos Filos · Marie Ossenkopf · Angeliki Lazaridou · Jakob Foerster · Ryan Lowe · Douwe Kiela · Kyunghyun Cho -
2019 Poster: Hyperbolic Graph Neural Networks »
Qi Liu · Maximilian Nickel · Douwe Kiela -
2018 Workshop: Emergent Communication Workshop »
Jakob Foerster · Angeliki Lazaridou · Ryan Lowe · Igor Mordatch · Douwe Kiela · Kyunghyun Cho -
2018 : Panel Discussion »
Antonio Torralba · Douwe Kiela · Barbara Landau · Angeliki Lazaridou · Joyce Chai · Christopher Manning · Stevan Harnad · Roozbeh Mottaghi -
2018 : Douwe Kiela - Learning Multimodal Embeddings »
Douwe Kiela -
2017 Workshop: Emergent Communication Workshop »
Jakob Foerster · Igor Mordatch · Angeliki Lazaridou · Kyunghyun Cho · Douwe Kiela · Pieter Abbeel -
2017 Poster: Poincaré Embeddings for Learning Hierarchical Representations »
Maximilian Nickel · Douwe Kiela -
2017 Spotlight: Poincaré Embeddings for Learning Hierarchical Representations »
Maximilian Nickel · Douwe Kiela