AI study finds dramatic decline in LLM reasoning

Even the best AI-based language learning models (LLMs) fail miserably when it comes to simple logical questions. This is the conclusion of researchers from the Jülich Supercomputing Centre (JSC), the School of Electrical and Electronic Engineering at the University of Bristol, and the LAION AI Lab. In their paper, “Alice in Wonderland: Simple tasks showing a complete breakdown in reasoning in large, state-of-the-art language models” (preview available at https://arxiv.org/abs/2406.02061), the scientists document a “dramatic degradation of functional and reasoning abilities” in the state-of-the-art LLMs tested and suggest that even if language models have the latent capacity to perform basic reasoning, they cannot access it in a robust and consistent manner. The study’s authors (Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev) call on “the scientific and technological community to stimulate an urgent reassessment of the claimed capabilities of the current generation of LLMs.” They also call for the development of standardized benchmarks to uncover weaknesses in language models related to basic reasoning abilities, as current tests have apparently failed to uncover this serious deficiency.

The essence of correct reasoning

The “common sense task,” called the “AIW problem” in the paper, is actually simple: “Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?” The values ​​of N and M (always natural numbers) and the order of siblings are varied. Thus, the researchers used various combinations of numbers and prompt types in order to get a precise overview of how the different models behave under systematic variations of the AIW problem. Regardless of the variations, the structure of the problem remains the same and the correct answer always results from the addition of Alice and her sisters (M + 1). A logic that most elementary school children can already follow. The linguistic models, on the other hand, were able to solve the AIW problem only sporadically, if at all. Here is an example of an LLM who was confronted with the simplest version of the question:

This may sound plausible, but it is still wrong (of course, Alice’s brothers have two sisters). The other language AIs tested also have problems – big problems, depending on the question. Sometimes they get tangled up in absurd reasoning, repeatedly arriving at incorrect results and declaring them “correct”. So it’s not just the wrong results that are the problem, but also the fact that the AIs use pseudo-sensible arguments to support them. Even the researchers’ interventions to encourage them to critically examine their answers don’t help. As a result, the researchers believe:[…] The models also express strong overconfidence in their wrong solutions, while providing often absurd “reasoning” type explanations […] to justify and support the validity of their clearly failed answers, making them plausible.

More than one answer out of two is wrong

Overall, LLMs had an average correct answer rate well below 50%, with larger models generally performing much better than smaller ones (e.g., GPT-4o showing a correct answer rate slightly above 60%), again highlighting the benefits of larger scales.However, even the larger-scale models do not perform well enough for a model with robust basic reasoning. Specifically, the very large fluctuations observed on even small variations of the AIW problem clearly indicate that the models are not capable of robust basic reasoning, and thus are confused even when faced with minor changes to the problem that should not matter for providing a correct solution. A more difficult version of the problem (“AIW+ problem”) ultimately pushed all the models to the limit of their reasoning abilities. According to the researchers, many of the tested models also score very high on various standardized benchmarks designed to test various abilities, including reasoning, while failing on the very simple AIW problem. In their paper, the scientists therefore suggest that these benchmarks do not adequately reflect the deficits in basic reasoning of these models, also calling into question the use of current standardized benchmarks for model comparison.

Language models on the test bench

The study has not yet been peer-reviewed, but its findings are already making waves. What are the real capabilities of LLMs? What does the failure of LLMs in elementary-level tasks mean? Jenia Jitsev (JSC), co-author, says: “We are overwhelmed with discussions and questions following our study.” The scientists’ findings call into question many issues and make further studies on the competence of language models absolutely essential. Jenia Jitsev: “Our study provides extremely important new insights into the real capabilities of language models to draw correct conclusions by following appropriate basic reasoning. Further complementary research is needed to understand how and why the basic reasoning of current models fails on such simple problems.”

/Public dissemination. This content from the original organization/authors may be of a timely nature and edited for clarity, style, and length. Mirage.News takes no institutional position or bias, and all views, positions, and conclusions expressed herein are solely those of the author(s). See the full story here.

Leave a Comment