CHINA / SOCIETY
AI models get poor score when tackling gaokao questions
Published: Jun 20, 2024 12:07 AM
AI Photo:VCG

AI Photo:VCG


 
China's gaokao is widely acknowledged as one of the most challenging college entrance exams globally. When artificial intelligence (AI) models were used to answer the exam questions, the highest scores only reached 303 out of the total score of 420, and all of them failed in the math section.

Out of seven AI models, Alibaba Cloud's Tongyi Qianwen 2-72B secured the top spot with a score of 303, followed by GPT-4o developed by OpenAI with a score of 296, and InternLM2 from Shanghai AI Lab ranking third. The Mistral, developed by a French startup, ended up with the lowest ranking.

The evaluations were conducted on OpenCompass, an open-source LLM evaluation platform developed by Shanghai AI Lab. 

The results showed the Chinese and English exam levels of the large models were generally good, but they all failed in mathematics, with the highest score being only 75 points, coming from InternLM2, followed by GPT-4o with a score of 73 points. 

The highest score in Chinese was achieved by Tongyi Qianwen, and in English by GPT-4o. There is still a lot of room for improvement in mathematics for the large models, according to Shanghai AI Lab.

Teachers who were involved in grading the papers said essays from large-scale models are more like question-and-answer tasks, with a lack of techniques such as using examples for evidence, citing references and famous quotes, which human candidates typically use. 

Most models cannot comprehend language concepts such as metaphors, and metaphorical expressions, and still struggle to fully grasp some of the implicit meanings in language, Shanghai AI Lab said.

Regarding mathematics, the teachers said large models are relatively bad at answering subjective questions. In some cases, there were process errors but the correct answer was obtained. The exam evidence shows that large models have a strong ability to memorize formulas, but they are unable to apply them flexibly in the problem-solving process.

Math involves complex reasoning abilities, which is a common challenge faced by large models and a key capability required for reliable implementation in various industrial scenarios, industry observers said. 

In English, the overall performance was good, but some models had lower scores in English essays due to exceeding the word limit, while human candidates often lose points for not meeting the word count, the lab said. 

Gaokao refers to the annual national college entrance exam, which is regarded as one of the most important exams for Chinese students. The evaluation was based on the national new curriculum standard paper, testing Chinese, mathematics, and English, including both objective and subjective questions.

The grades were anonymously scored by at least three teachers with experience in grading gaokao papers. Before the grading, the teachers were not informed that the answers were all generated by AI models, according to Shanghai AI Lab. 

Global Times