News Release

Computer Software Grades Essays Just As Well As People, Profs Announce

Peer-Reviewed Publication

University of Colorado at Boulder

New computer software can grade the content of essay exams just as well as people and could be a major boon in assessing student performance, researchers at the University of Colorado at Boulder and New Mexico State University announced today.

"From sixth graders to first-year medical students we get consistently good results," said Thomas K. Landauer, a CU-Boulder psychology professor who has worked on the technology behind the program for 10 years. "It's ready."

The computer software, called Intelligent Essay Assessor, uses mathematical analysis to measure the quality of knowledge expressed in essays. It is the only automatic method for scoring the knowledge content of essays that has been extensively tested and published in peer-reviewed journals.

The system was developed by Landauer, Darrell Laham, a CU-Boulder doctoral student and Peter W. Foltz, an assistant professor of psychology at NMSU. They will discuss the system Thursday, April 16, during the annual meeting of the American Educational Research Association in San Diego.

"We are continually surprised at how well it works," said Landauer, who started on the project as director of cognitive science research at Bellcore.

The grading system has important implications for assessing student writing and helping students improve their writing, Foltz said. In one of his undergraduate psychology classes at NMSU last fall, Foltz tested a version of the program.

"Students submitted essays to a web page and received immediate feedback about the estimated grade for their essays, and suggestions about what was missing," Foltz said. "Students could revise their essays and resubmit them as many times as they wanted. The students' essays all improved with each revision."

Foltz also gave students the choice of having their essays graded by a human or by the computer. "They all chose to have the computer do the grading," he said.

Educators laud essay exams because they provide a better assessment of students' knowledge than other types of tests. A huge drawback is that the tests are time-consuming and difficult to grade fairly and accurately, particularly for large classes or nationally administered exams.

But computer-based evaluations of student writing are becoming increasingly feasible because of the growing numbers of students who write using computers. The researchers have applied for a patent on their software.

The new system requires a computer with about 20 times the memory of an ordinary PC to do the statistical analysis that it needs to "understand" essays. It uses Latent Semantic Analysis, a new type of artificial intelligence that is much like a neural network. "In a sense, it tries to mimic the function of the human brain," Laham said.

First the software program is "fed" information about a topic in the form of 50,000 to 10 million words from on-line textbooks or other sources. It learns from the text and then assigns a mathematical degree of similarity or "distance" between the meaning of each word and any other word. This allows students to use different words that mean the same thing and receive the same score. For example, they could use "physician" instead of "doctor."

The program then evaluates essays in two primary ways. The first is for a teacher or professor to grade enough essays to provide a good statistical sample and then use the software to grade the remainder.

"It takes the combination of words in the student essay and computes its similarity to the combination of words in the comparison essays," Laham said. The student then receives the same grade as the human-graded essays to which it is most closely matched.

"The program has perfect consistency in grading -- an attribute that human graders almost never have," Laham said. "The system does not get bored, rushed, sleepy, impatient or forgetful." In one test, both the Intelligent Essay Assessor and faculty members graded essays from 500 psychology students at CU-Boulder. "The correlation between the two scores was very high -- it was the same correlation as if two humans were reading them," Landauer said.

The software only evaluates knowledge content and is not designed to grade stylistic considerations like grammar and spelling, researchers said. Existing programs already can do those functions.

A second Intelligent Essay Assessor method compares all the student essays to a single professor's or expert's essay, a so-called "gold standard." A third variation can tell students what important subject matter was missing from their essays and where to find it in the textbook.

Previous methods of automatic essay scoring simply counted words and then analyzed mechanics and aspects of grammatical style, the researchers said.

There is a strong correlation between students who write the most and students who write the best, researchers said. This is because students who know a lot write a lot.

The amount of content also counts in the Intelligent Essay Assessor, but it is measured by concepts, not by the number of words. The researchers recommend setting an essay word limit to eliminate length as a factor.

Because the system does not analyze surface form, it is possible that someone could include all the right words in an essay -- in random order -- and get a good grade, they said. The system will flag unusual essays for that and other reasons for a human to check. But the team discovered an even better safeguard while trying to fool the system.

"If you wrote a good essay and scrambled the words you would get a good grade," Landauer said. "But try to get the good words without writing a good essay!

"We've tried to write bad essays and get good grades and we can sometimes do it if we know the material really well. The easiest way to cheat this system is to study hard, know the material and write a good essay." - 30 -

###


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.