viralamo

Menu
  • Technology
  • Science
  • Money
  • Culturs
  • Trending
  • Video

Subscribe To Our Website To Receive The Last Stories

Join Us Now For Free
Home
Technology
Researchers find ‘inconsistent’ benchmarking across 3,867 AI research papers
Technology

Researchers find ‘inconsistent’ benchmarking across 3,867 AI research papers

11/08/2020

The metrics used to benchmark AI and machine learning models often inadequately reflect those models’ true performances. That’s according to a preprint study from researchers at the Institute for Artificial Intelligence and Decision Support in Vienna, which analyzed data in over 3,000 model performance results from the open source web-based platform Papers with Code. They claim that alternative, more appropriate metrics are rarely used in benchmarking and that the reporting of metrics is inconsistent and unspecific, leading to ambiguities.

Benchmarking is an important driver of progress in AI research. A task (or tasks) and the metrics associated with it (or them) can be perceived as an abstraction of a problem the scientific community aims to solve. Benchmark data sets are conceptualized as fixed representative samples for tasks to be solved by a model. But while benchmarks covering a range of tasks including machine translation, object detection, or question-answering have been established, the coauthors of the paper claim some — like accuracy (i.e., the ratio of correctly predicted samples to the total number of samples) — emphasize certain aspects of performance at the expense of others.

In their analysis, the researchers looked at 32,209 benchmark results across 2,298 data sets from 3,867 papers published between 2000 and June 2020. They found the studies used a total of 187 distinct top-level metrics and that the most frequently used metric was “accuracy,” making up 38% of the benchmark data sets. The second and third most commonly reported metrics were “precision,” or the fraction of relevant instances among retrieved instances, and “F-measure,” or the weighted mean of precision and recall (the fraction of the total relevant instances actually retrieved). Beyond this, with respect to the subset of papers covering natural language processing, the three most commonly reported metrics were BLEU score (for things like summarization and text generation), the ROUGE metrics (video captioning and summarization), and METEOR (question-answering).

For more than two thirds (77.2%) of the analyzed benchmark data sets, only a single performance metric was reported, according to the researchers. A fraction (14.4%) of the benchmark data sets had two top-level metrics, and 6% had three metrics.

The researchers note irregularities in the reporting of metrics they identified, like the referencing of “area under the curve” as simply “AUC.” Area under the curve is a measure of accuracy that can be interpreted in different ways depending on whether it’s drawn plotting precision and recall against each other (PR-AUC) or recall and the false-positive rate (ROC-AUC). Similarly, several papers referred to a natural language processing benchmark — ROUGE — without specifying which variant was used. ROUGE has precision- and recall-tailored subvariants, and while the recall subvariant is more common, this could lead to ambiguities when comparing results between papers, the researchers argue.

Inconsistencies aside, many of the benchmarks used in the papers surveyed are problematic, the researchers say. Accuracy, which is often used to evaluate binary and multiclass classifier models, doesn’t yield informative results when dealing with unbalanced corpora exhibiting large differences in the number of instances per class. If a classifier predicts the majority class in all cases, then accuracy is equal to the proportion of the majority class among the total cases. For example, if a given “class A” makes up 95% of all instances, a classifier that predicts “class A” all the time will have an accuracy of 95%.

Precision and recall also have limitations in that they focus only on instances predicted as positive by a classifier or on true positives (accurate predictions). Both ignore the models’ capacity to accurately predict negative cases. As for F-scores, they sometimes give more weight to precision versus recall, providing misleading results for classifiers biased toward predicting the majority class. Besides this, they’re only able to focus on only one class.

In the natural language processing domain, the researchers highlight issues with benchmarks like BLEU and ROUGE. BLEU doesn’t consider recall and doesn’t correlate with human judgments of machine translation quality, and ROUGE doesn’t adequately cover tasks that rely on extensive paraphrasings such as abstractive summarization and extractive summarization of transcripts with many different speakers, like meeting transcripts.

The researchers found that better metric alternatives such as the Matthews correlation coefficient and the Fowlkes-Mallows index, which address some of the shortcomings in accuracy and F-score metrics, weren’t used in any of the papers they analyzed. In fact, in 83.1% of the benchmark data sets where the top-level metric “accuracy” was reported, there weren’t any other top-level metrics, and F-measure was the only metric in 60.9% of the data sets. This was also true of the natural language processing metrics. METEOR, which has been shown to strongly correlate with human judgment across tasks, was used only 13 times. And GLEU, which aims to assess how well generated text conforms to “normal” language usage, appeared only three times.

The researchers concede their decision to analyze preprints as opposed to papers accepted to scientific journals could skew the results of their study. However, they stand behind their conclusion that the majority of metrics currently used to evaluate AI benchmark tasks have properties potentially resulting in an inadequate reflection of a classifiers’ performance, especially when used with imbalanced datasets. “While alternative metrics that address problematic properties have been proposed, they are currently rarely applied as performance metrics in benchmarking tasks, where a small set of historically established metrics is used instead. NLP-specific tasks pose additional challenges for metrics design due to language and task-specific complexities,” the researchers wrote.

A growing number of academics are calling for a focus on scientific advancement in AI rather than better performance on benchmarks. In a June interview, Denny Britz, a former resident on the Google Brain team, said he believed that chasing state-of-the-art is bad practice because there are too many confounding variables and because it favors large, well-funded labs like DeepMind and OpenAI. Separately, Zachary Lipton (an assistant professor at Carnegie Mellon University) and Jacob Steinhardt (a member of the statistics faculty at the University of California, Berkeley) proposed in a recent meta-analysis that AI researchers hone in on the how and why of an approach as opposed to performance and conduct more error analysis, ablation studies, and robustness checks in the course of research.

Source link

Share
Tweet
Pinterest
Linkedin
Stumble
Google+
Email
Prev Article
Next Article

Related Articles

2020 will be a big year for online childcare — here are 7 startups to watch
TechCrunch ist Teil von Verizon Media. Klicken Sie auf ‘Ich …

SpaceX said to be seeking around $250 million in funding, boosting valuation to roughly $36 billion

StreamElements provides free services as part of a new diversity program
StreamElements wants to help bring more diverse creators into livestreaming. …

StreamElements provides free services as part of a new diversity program

Leave a Reply Cancel reply

Find us on Facebook

Related Posts

  • Google’s MixIT AI isolates speakers in audio recordings
    Google researchers use quantum computing to help …
    14/08/2020
  • AI ethics pioneer’s exit from Google involved research into risks and inequality in large language models
    AI ethics pioneer’s exit from Google involved …
    04/12/2020
  • Tony Hawk’s Pro Skater 1 and 2 review — Skateboarding is gaming’s soulmate
    Tony Hawk’s Pro Skater 1 and 2 …
    03/09/2020
  • Modern Warfare is the most-played Call of Duty this generation
    May 2020 NPD: Surging video game sales …
    13/06/2020
  • Beat Saber is now an Oculus studio after Facebook acquisition
    Unpacking Sequoia’s $21M conflict of interest
    10/03/2020

Popular Posts

  • 100 million more IoT devices are exposed—and they won’t be the last
    100 million more IoT devices are exposed—and …
    14/04/2021 0
  • Top 10 Places Creepier Than Stephen King’s …
    17/03/2021 0
  • Top 10 Things You Should Know About …
    17/03/2021 0
  • I was a teenage Twitter hacker. Graham Ivan Clark gets 3-year sentence
    I was a teenage Twitter hacker. Graham …
    17/03/2021 0
  • DDoSers are abusing Microsoft RDP to make attacks more powerful
    ~4,300 publicly reachable servers are posing a …
    18/03/2021 0

viralamo

Pages

  • Contact Us
  • Privacy Policy
Copyright © 2021 viralamo
Theme by MyThemeShop.com

Ad Blocker Detected

Our website is made possible by displaying online advertisements to our visitors. Please consider supporting us by disabling your ad blocker.

Refresh