Large language models (MLs) can perform abstract reasoning tasks, but they are susceptible to many of the same errors that humans make. Andrew Lampinen, Ishita Dasgupta, and colleagues tested state-of-the-art MLs and humans on three types of reasoning tasks: natural language inference, assessing the logical validity of syllogisms, and the Wason selection task. The authors found that MLs were subject to similar content effects as humans. Both humans and MLs were more likely to mistakenly label an invalid argument as valid when the semantic content was sensible and credible. LMs are as bad as humans at the Wason selection task, in which the participant is presented with four cards with letters or numbers written on them (e.g., “D,” “F,” “3,” and “7”) and asked which cards to turn over to test the correctness of a rule such as “if a card has a “D” on one side, then it has a “3” on the other side.” Humans often choose to turn over cards that offer no information about the validity of the rule but that test the contrapositive rule. In this example, humans would tend to choose the card labeled “3,” even though the rule does not imply that a card with a “3” would have a “D” on the back. LMs make this and other errors, but have an overall error rate similar to that of humans. Both human and LM performance on the Wason selection task improves if rules about arbitrary letters and numbers are replaced with socially relevant relationships, such as people’s ages and whether a person drinks alcohol or soda. According to the authors, LMs trained on human data appear to have some human weaknesses in reasoning and, like humans, may require formal training to improve their logical reasoning performance.
/Public Release. This material from the original organization/authors may be of a timely nature and edited for clarity, style, and length. Mirage.News takes no institutional position or bias, and all views, positions, and conclusions expressed herein are solely those of the author(s).See full article here.