How well can AI grade?

Key points:

AI is revolutionary–but can it effectively assess students’ writing?
Innovative education: AI and AGI shaping tomorrow’s universities
Defining higher-ed policy for AI in teaching and learning
For more on AI in education, visit eCN’s Teaching & Learning hub

As an experiment, over the summer, I asked my students if they would let me put their work through artificial intelligence (AI) for feedback and a grade. Because I was using GPT within PowerNotes, a platform that doesn’t deliver students’ data to OpenAI, their work would not be used to train its large language model, and all data would be deleted after 30 days.

I told my students I would compare the AI grading and feedback with my own, and only share the AI’s reaction if they wanted to see it. I wasn’t sure if anyone would agree, but six out of 23 students said yes, and they were all curious to see what the AI had to say. I did the process using GPT-3.5 first and then GPT-4. Here’s how it played out, from creating rubrics to drawing conclusions.

Creating rubrics and instructions

The first thing I did was create an organizational structure in PowerNotes that really leaned into the tool’s features. I created topics called Instructions, Rubric, Student Work, and AI Feedback for each assignment I was going to process. After I created the topics, I went into the Instructions topic and pasted in the instructions for the activity. Then I copied and pasted my assessment matrix into the Rubrics topic. To provide maximum clarity for the AI, I reformatted and clearly labeled each criterion, what to look for, and how many points it was worth. (I initially tried not doing that extra step, and it didn’t go well.)

‍Asking GPT-3.5 for feedback and scores

Because PowerNotes let me select the actual text I wanted it to focus on, I kept the prompts simple. The first prompt I used was, “Use the Instructions to provide feedback on the Student Work.” I selected the entire Instructions topic and the one piece of student work I wanted feedback on. Each student submission was labeled “Student Work” and assigned a number. I went through this process for two assignments and two discussion boards. Then I put my feedback alongside the AI feedback into a table for students to review.

My second prompt was, “Use the Rubric to score the Student Work.” I reviewed the AI feedback and left notes where the AI was wrong or gave feedback that wasn’t relevant.

I also instructed the AI to “Act as a Rhetoric and Composition professor” and “Act as an Information Literacy Professor,” but these prompts only led the AI to alter some word choices, not change the content of the feedback.

Grading AI’s ability to grade

If I had to pick one word to describe AI’s performance as a grader, I would say “inconsistent.” Specifically, the AI:

Suggested “improvements” that were already in the students’ work: The AI told one student that they needed to include the research question, which they had. It also claimed that several students hadn’t fulfilled the requirement of addressing bias or credibility, which they had.

Gave a score and feedback on things that weren’t there: For example, the rubric for one of the assignments asked for a screenshot of search results. I didn’t include them when I ran the student work through, but the AI assumed they were there.

Made up things it didn’t know: The AI informed one student, “For reference citation in APA 7 format, make sure you capitalize the first letter of all major words in the article title, not just the first word.” This was one of several pieces of objectively wrong information it shared.

‍Gave unsound writing advice: The second assignment required three paragraphs: the first to introduce the topic, the second to explain how the article they were analyzing helped them begin to answer their research question, and the third to reflect on what they still wanted to know and how they might find it out. The AI often asked for more details in student work that already had the right balance of article details and what it taught them, and in other cases didn’t offer suggestions in places where more detail was needed.

Why I’m better at feedback than AI

At this point in its development, AI can’t be specific and contextual because it has no personal experience. From a teaching and learning perspective, that’s a problem. AI’s lack of specificity was most obvious when students had incorrect APA citations. Where the AI made general recommendations, I could tell them what they needed to change and how to change it. Here’s an example:

My Feedback	AI Feedback
For your APA, the journal title and volume number should be italicized: “Journal of Communication Disorders, 65, 35-42…”	The selected article citation is formatted correctly in APA 7 format, but it would have been helpful if the student had provided a brief summary of the article and explained why it was chosen over other articles found in their search.

I gave the AI the context of the assignment, but it didn’t have the context of the students. In the example below I was both responding to this student’s assignment and referencing their previous discussion board post. ‍

My Feedback

AI Feedback

Online education is near and dear to me, so I’m looking forward to reading the article you picked. I agree that engagement is useful (it often determines successful achievement of learning outcomes) in this case. Something that wasn’t quite clear, and the nuance might be useful, is are you interested in how Covid impacted enrollment at the height of the pandemic or its ongoing impact? As we move forward, you’ll be evaluating this article in a deeper dive, so think.

Additionally, the student may want to consider revising their preliminary topic statement. While it is a good starting point, it may be too broad to effectively guide their research. They could focus on a particular aspect of online higher education, such as the impact on student engagement or the effectiveness of online teaching methods.

‍The AI clearly missed that the student had talked about engagement because it didn’t have the context of the previous discussion. For AI to learn the context of my students, I would need to give it access to “read” an entire course, making the content, student work, conferencing, and even in-person interactions part of the AI’s database. That feels like a level of FERPA issues I would rather not grapple with.

There were also a number of times when I could give helpful tips for working through common research problems, such as the fact that search terms change a great deal. The AI gave general suggestions, but I gave a specific one:

My Feedback	AI Feedback
I’m glad the shift you made in your search phrases was useful across both, as well. When I do this kind of work, I like to keep track of these shifts (in part so I don’t shift back to something I already know doesn’t work!).	To improve searching skills, I suggest exploring different databases and trying out different keywords and search statements. It’s important to be patient and persistent in searching for relevant sources.

‍Finally–and this is obviously a major drawback when it comes to grading–AI doesn’t do math well. For example, after taking 1 point off in one criterion worth 50 points and 2 points off in another criterion worth 50 points, it offered this final score: 247/250 or 47.75/50.

Not surprisingly, when I asked students for their preference, they cited inaccurate scoring and feedback as the reasons for preferring my grading to GPT-3.5’s.

Because this is AI, though, by the time I had drawn my conclusions, there was a new version that promised all sorts of improvements. So I put GPT-4 to the test as well.

Does GPT 4 do it better?

In all of the areas I mentioned above, GPT performed better but still not consistently. In the example below, it was able to better parse and process student work, but it still missed things that the student needed to work on.

My Feedback	AI Feedback	GPT-4 Feedback
For the article justification, you needed a bit more detail — so why are the statistics and references important? As we move forward, you’ll be evaluating this article in a deeper dive.	None.	7) Reason for Choosing Article: Your reasoning for choosing the article is sound, emphasizing its relevance, trustworthiness, recent publication date, and peer-reviewed status.
You also needed a bit more detail in the usefulness of the article you selected. Why was it a good middle ground? How will getting an unbiased perspective help you answer your question?	None.	Reason for Choosing Article: The explanation for choosing the article shows thoughtful consideration. It shows that the student carefully evaluated how this article would contribute to their research.
While it can be useful to have any sort of evidence that supports the connection, don’t forget the value of NOT in your Boolean search. If your search is giving you things you don’t want, you can add a NOT word or phrase to rein the search back into focus.	The student could benefit from using more specific keywords. For example, instead of using “mental health,” they could use more specific terms such as “anxiety,” “depression,” or “stress”	4) Boolean Search Statement: Your Boolean search statement “Pet owners’ AND mental health” and advanced search statement “Pet owners” AND (improv* OR “helping mental health” OR “psychological benefits” OR “emotional well-being” OR “mental well-being”) demonstrate good understanding of this concept.

Sometimes, it did a fantastic job of pulling samples from student work to indicate what they’d done, but it didn’t do it for every student every time.

After doing a round with GPT-4, I still believe that feedback will always require a human eye and that professors would need an invasive level of AI in our classrooms to provide the kind of context it needs to provide feedback the way we can. Instructors are still essential, but with AI’s help in certain parts of our processes, we may, finally, evolve from gatekeepers into collaborators in our students’ writing.

Sign up for our newsletter

First Last

Email(Required)

Newsletter: Innovations in K12 Education

Yes! I want to receive the Innovations in Higher Education Newsletter

By submitting your information, you agree to our Terms & Conditions and Privacy Policy.

Author
Recent Posts

eSchool Media Contributors

Dr. Catrina Mitchum, University of Maryland Global Campus

Dr. Catrina Mitchum is an adjunct associate professor with University of Maryland Global Campus and co-author of the forthcoming Teaching Literacy Online: Engaging, Analyzing, and Producing in Multiple Media. Her scholarly work has been published in Currents in Teaching and Learning, Composition Forum, and The Journal of Teaching and Learning with Technology, among other journals and edited collections. She was awarded, with other scholars, the CCCC Research Initiative Grant in 2018 and a Digital Learning Tech Seed Grant in 2021, and in 2021-2022, she was an Inclusive Leadership Fellow. She can be reached via LinkedIn.

Latest posts by eSchool Media Contributors (see all)

Invisible translators: What funky techs do for higher ed - June 10, 2026
Beyond compliance: Governing higher education in the age of intelligent systems - June 8, 2026
The Canvas ransomware attack shows why schools must focus on containment, not just recovery - June 5, 2026

Stay up-to-date with the
INNOVATIONS
in Higher Education Newsletter

Email(Required)

By submitting your information, you agree to our Terms & Conditions and Privacy Policy.