News Release 11-Aug-2025

Assessing and understanding creativity in large language models

Peer-Reviewed Publication

Beijing Zhongke Journal Publising Co. Ltd.

Overview of the creativity assessment framework — **image:**
**A TTCT-inspired dataset was constructed to evaluate LLMs under varied prompts and role-play settings. GPT-4 served as the evaluator to score model outputs.**
view more

Credit: Beijing Zhongke Journal Publising Co. Ltd.

In recent years, the realm of artificial intelligence (AI) has witnessed a meteoric rise in the development and sophistication of large language models (LLMs). LLMs have significantly advanced in their capabilities in addressing a variety of conventional natural language processing tasks, such as reasoning and natural language understanding. Moreover, LLMs also have demonstrated significant value in widespread applications. From transforming rudimentary text into compelling narratives, unlocking a new realm of storytelling, to solving complex algorithmic problems, these models have shown a semblance of what could be interpreted as creativity. The practical manifestations of this creativity have penetrated various sectors, including science research, where they assist in idea generation and suggestion; education, by providing personalized learning experiences; and in the entertainment industry, creating music and art. In many of their applications, LLMs seem to exhibit the ability to generate original text, aiding tasks related to imagination and creativity, suggesting that they may indeed possess elements of creativity.

From the broad capabilities demonstrated by LLMs, the creativity they exhibit is a key reason they are considered powerful. However, behind the impressive abilities of LLMs lies a significant question that warrants careful examination: Do these models actually possess real creativity, or is their apparent intelligence merely an illusion – a complex imitation of human thinking created by their training paradigm? This question touches on the very nature of LLM intelligence, which may not be easily explained. Since LLMs have shown considerable creativity, understanding the extent and characteristics of this creativity is essential. Gaining deeper insight into the creativity of LLMs can not only guide people in further improving their performance but also in enhancing peoples’ understanding of the nature of their creativity. This, in turn, informs their daily use and application of these models, underscoring the need for an effective method to measure and assess their creativity. Specifically, creative abilities are critical for the following application scenarios. First, LLM can inspire humans on creative tasks and provide novel ideas, especially in research idea generation. It has also been suggested that the use of LLM can also lead to homogenization of creativity. Second, humor generation with LLMs offer significant value in both creative and practical applications. By simulating human-like humor, LLMs can assist in content creation for entertainment, marketing, and social media. Finally, LLMs can serve as powerful cocreators in creative writings by generating narrative ideas, suggesting plot developments, or even drafting sections of text that inspire further refinement by human writers.

Creativity, as a term, traditionally refers to the natural ability to think innovatively, to make unconventional connections, and to devise solutions that are both novel and effective. Assessing the creativity of LLMs is fraught with challenges. First, the question of creativity does not have clear answers to refer to. When people ask an LLM a question such as “what is the speed of light in vacuum in meters per second?”, the answer can be formally vetted, given the objective nature of the topic. However, when posed with a prompt such as “what would be the implications if animals could talk?”, the situation becomes different in this case because there is no definitive answer and the answer is open and divergent, making it challenging to judge the correctness of the output. Additionally, since creativity encompasses various aspects, including originality and flexibility, it is necessary to design diverse tasks and criteria to measure these qualities effectively in LLMs. In addition, there are differences between LLMs and humans, which might lead to irrelevant responses or serious logical issues, requiring us to additionally assess these aspects. Finally, evaluating creativity necessitates a delicate balance between accuracy and efficiency, rendering traditional human-based evaluation methods less practical. Therefore, it is imperative to address the challenges outlined above to make a robust and sound assessment of creativity in LLMs.

Recognizing the need for a comprehensive assessment of LLM’s creativity, researchers of the paper published in Machine Intelligence Research design an efficient framework to automatically assess the creativity of LLMs by adapting and modifying the Torrance tests of creative thinking (TTCT), a widely recognized tool in psychometrics’ research for human creativity assessment. To enhance the credibility of the results and reduce the randomness, seven verbal tasks, which use verbal stimuli, were selected. Researchers employed GPT-4, the most advanced LLM, to expand the question set for each task, thereby constructing the testing dataset. To ensure a thorough and objective evaluation of creativity and capture creativity’s various manifestations, researchers combine diverse tasks and criteria. Researchers design a comprehensive test protocol incorporating four criteria for measuring creativity: Fluency, flexibility, originality, and elaboration. Researchers let the LLMs answer questions from the constructed dataset, obtaining many question-answer pairs. Researchers utilized GPT-4 as an evaluator to assess each answer, as the GPT-4 is capable of effectively assessing the openness of responses and identifying their shortcomings and errors. Under proper prompt engineering, GPT-4 can efficiently and effectively complete the evaluation of the entire dataset results. Thus, researchers can achieve a balance between efficiency and accuracy in their assessment method.

Researchers selected six popular LLMs as test subjects, each possessing different architectures and parameter scales. In addition to the overall testing, researchers conducted some additional exploratory experiments that investigate the changes of creativity levels exhibited by LLMs when given different types of prompts and different roles that LLMs play. Then, researchers designed a collaboration mechanism for LLMs to explore the impact of multiple LLMs collaborating on creativity. Last, researchers also performed some psychological experiments related to personality traits on the LLMs, including emotional intelligence (EI), empathy, the big five inventory (BFI) and self-efficacy. Because researchers found in relevant psychological research showing that human creativity is correlated with these personality traits and researchers verified the consistency between LLMs and humans in this regard.

Researchers’ experiments and analysis yielded several conclusions. First, there are significant differences in creative performance among different models, even among those of the same scale with an equal number of parameters. This variation primarily exists between different types of models. Their differences are reflected mainly in the model architecture, parameter settings during training, alignment strategies, and the datasets used for training. Additionally, researchers observed that models generally excel in the elaboration metric, but tend to be less adept in demonstrating originality. In addition, the type of prompt and the specific role-play request given to the model also plays a significant role in influencing its creative output. When the models are given instructive prompts or chain-of-thought prompts, there is a significant increase in the level of creativity. Additionally, having LLM play different roles leads to notable differences; the role of a scientist demonstrates the highest level of creativity. Many roles even show a decrease compared to the default scenario, but there is generally an improvement in originality. Then, collaboration among multiple LLMs can enhance the level of creativity, with the most notable improvement in originality. Finally, the results of the psychological scale revealed consistency between LLMs and humans in terms of associated creativity factors, such as emotional intelligence (EI), empathy, self-efficacy, and others.

Section 2 reviews related works in three aspects. It first introduces creativity assessment in psychological research. Then, it discusses findings in psychological research on creativity and personality. Finally, it addresses the assessment of large language models' (LLMs) creativity.

In Section 3, researchers design an overall framework to evaluate LLM’s creativity. First, researchers constructed a dataset containing 700 questions of 7 tasks that were derived and modified from the psychology scale of the TTCT and expanded the number of questions via GPT-4. Researchers tested six models on four different criteria using the dataset they constructed. Following this, researchers conducted a series of experiments on the creativity of LLMs when giving different types of prompts and assigning different roles to LLMs. Finally, researchers used the GPT-4 as the evaluator to obtain the performance results of the LLMs and verify the consistency of the LLM-based evaluation with humans.

Section 4 introduces evaluation and results. Researchers conducted a statistical analysis of the creativity scores of 6 popular LLMs across seven tasks, totaling 700 questions. Researchers unveiled hidden conclusions within the data results from various dimensions. Researchers compared the differences in creativity levels between the models, and they compared the performance variations under different criteria within the same model. Subsequently, researchers experimented with many types of prompts to see whether changes in prompts would affect the models′ levels of creativity. Since LLMs possess the ability to play user-specified roles, researchers select six typical human identities to explore the impact on creativity under different role-playing conditions. Finally, researchers utilize some psychological scales to test the LLMs, investigating the correlation between the personality traits of the LLMs and creativity.

Section 5 are about conclusions and discussions. Researchers believe that the creativity exhibited by LLMs is only an outcome-oriented interpretation. Whether AI models possess true creativity from a human cognitive perspective remains an open question in the field of artificial intelligence. LLM’s expression of creativity is likely to be an imitation of human creativity through a large amount of learning. Understanding the creativity of LLMs is also beneficial for uncovering the inner secrets of the model “black box”, and for a deeper understanding of the nature of intelligence and cognition. Although analyzing the nature of creativity is difficult, researchers’ analysis and evaluation of LLM creativity performance is fundamental to study of the kernel of creativity.

See the article:

Assessing and Understanding Creativity in Large Language Models

http://doi.org/10.1007/s11633-025-1546-4

Journal

Machine Intelligence Research

DOI

10.1007/s11633-025-1546-4

Article Title

Assessing and Understanding Creativity in Large Language Models

Article Publication Date

28-Apr-2025

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.