15 Jun , 17:16
0
An international team of researchers posed a seemingly simple challenge to leading language models — the classic Stroop test, which psychologists have used for nearly a hundred years to measure the ability to concentrate. The result turned out to be discouraging: the longer the task, the more helpless the AI becomes — to the point of near-total failure. The study was published in the journal PNAS Nexus.
The essence of the Stroop test is simple: a subject is shown words denoting colors but written in ink of a different color, and is asked to name the ink color while ignoring the word itself. For example, the word "red" typed in blue font requires the answer "blue." The human brain handles this consistently even with long lists — it knows how to suppress the automatic response and maintain focus.
The team led by Suketu Patel administered this test to several flagship models at once — GPT-4o, Claude 3.5 Sonnet, GPT-5, Claude Opus 4.1, and Gemini 2.5. On short lists of five words, all systems showed confident results. However, as the length increased, accuracy collapsed dramatically: GPT-4o produced 91% correct answers with five words, dropped to 57% with ten, and a pitiful 15% with forty. Claude 3.5 held on longer than the others, maintaining an acceptable level up to twenty words, but then plummeted to 24%.
The study authors explain this effect by the fact that models, as they progress through the task, "lose sight" of the instruction and default to what they have learned most firmly — simply reading the words. This, in the scientists' opinion, is what fundamentally distinguishes AI from humans, who are capable of sustaining voluntary attention over extended periods of time.