Is ChatGPT and Authorship Verification the new King Kong vs GodZilla battle in academic cheating?

Posted on December 9, 2022 | [Eduardo Oliveira]


My own reflections on ChatGPT #1
(perspectives and opinions are my own)

keywords: chatgpt, stylometry, authorship verification, academic cheating, cognitive load

At this stage, you’ve probably heard or read something about ChatGPT, the super duper AI-powered chatbot released a few days ago by OpenAI. ChatGPT is a chatbot that provides meaningful and thorough responses to various questions. The technology behind this chatbot is so powerful (providing such convincing answers) that it has been dominating social media since its release. Are we academics ready for this?

To date (Dec 9, 2022), the new chatbot has exceeded one million users in just over a week. Regardless of your interest in AI, I hope this article can help you understand what’s happening and what this hype is all about and put it in perspective. In the next few paragraphs, I'll share my reflections on how this technology is already impacting universities, what we can do to protect ourselves and how we can benefit from it. The goal here is to promote conversation.

What's ChatGPT?

Last Monday (Dec 5, 2022), the brilliant Professor Phillip (Phil) Dawson from Deakin University shared a video at ASCILITE from TikTok creator Cleo Abram explaining ChatGPT and I loved it. Let me follow the same approach as Phil and share it with you here as well:

The video from Cleo helps us understand why ChatGPT is booming at the moment! ChatGPT can now generate software source code, stories, tweets, reports, and answers to complex essays or exams questions - all in a very authentic/original way.

👀 Give it a go yourself and explore tons of interactive possibilities with ChatGPT here: ChatGPT Playground.

This powerful technology is available now and will only get better in the future! Its next major update is expected to be just around the corner; many believe as soon as next year.

As much as these technologies have been posing several challenges to the educational sector, personally, I don't see a reason to panic nor a need to fight it. Rather, I suggest we can collaborate with and learn from it. It's true several publications/posts/videos have already shown that ChatGPT can produce credible academic writing undetectable by anti-plagiarism software. However, I see this moment as a wonderful opportunity to rethink assessment for learning! How can we tap into more higher-order thinking skills in our assignment designs [good reference for this: A Model of Learning Objectives]? What other innovative strategies can be adopted to assess students' knowledge? What have we been doing to ensure we can demonstrate and promote ethical practices regarding academic integrity in our universities? What strategies can we adopt to promote a culture of academic integrity among our students? How can we teach/encourage students to reference AI-generated texts used in their assignments? So many questions to ponder!

Regarding this last question on previous paragraph, Scott Aaronson, a computer scientist at the University of Texas in Austin who joined OpenAI to work on AI safety this year, highlighted a few days ago on his blog (Nov 28, 2022) that he and his team are currently working on watermarking GPT outputs:

We don't know yet if initiatives like this (and many new ones becoming available in the future) will make ChatGPT and similar tools (this is not just about ChatGPT... heaps of other cool AI tools are doing great things) 'safer' for academic institutions. Initiatives like this make me also think that many new open source models may be created to remove watermarks or to generate texts without it. Hence I don't think we should focus exclusively on what's happening next within these technologies. They will get better and there will be more AI tools doing better things in the near future. I don't think we can (or should) avoid this.

As controversial as this topic is, I believe we've been given a chance to rethink the way we've been assessing for learning. Our focus should remain on educational practices instead of on features of any newly available AI technology! Remember how hard it was (and it still is) to move from traditional paper exams to online open-book ones? And all the concerns in the use of calculators, encyclopedias, books, Google, mobile phones, and other devices and technologies in classrooms? But we've been learning from these challenges! Our educational practices have been evolving. We have new tools to support us to promote academic integrity. If anything, I believe ChatGPT is putting a spotlight on issues that have been happening for years in the educational sector. The good news is that we're now talking a lot more about them (and many positive things will start happening from these conversations soon).

What lecturers can do to design and assess assignments to avoid academic cheating issues caused by the use of ChatGPT from students

Given what we now know about ChatGPT, it's not hard to imagine that it, as well as GitHub Copilot, Grammarly and other AI-powered technologies, have already been used in academic cheating, which is completely different from what they were designed for. I'll say this again for the people in the back, these tools were not designed to promote academic cheating. GitHub Copilot, for example, was created to assist software developers in the development of software solutions. I've been using it for a while as part of the development of some personal codes to help me analyse data and I can tell you from my own experience that it has been helping me become a lot more productive within my tasks. I feel my code is clearer, better documented and easier to be reused. I add to the heading of my coding files what functions and methods were generated through GitHub Copilot (for transparency). Tools like these have existed for many years so neglecting to consider their usage and popularity among students until now means we (academics) are already late to this party.

On the other hand, we now have access to new and better tools such as Turnitin (similarity checking), MoSS (source code similarity check) and many others to address issues in academic integrity, deter plagiarism and support continuous conversation around academic cheating. Academic cheating, in this context, is dishonesty in academic work.

To keep it simple, let's think of students generating answers to exams or tasks using AI solutions such as ChatGPT. Unfortunately, these AI-generated answers wouldn't be detected by some of our current similarity checking tools because, again, they're authentic/original and may not be found in any other source available in the Internet.

I asked ChatGPT to help me with this:

Some great suggestions there :) What do you think?

I especially like suggestion #4 as this is related to the research I've been conducting at The University of Melbourne, together with colleagues from CIS, CHSE and the Teaching and Learning Lab. I strongly believe that designing exams that test higher-order thinking skills are better approaches than the use of multiple-choice, short-answer, randomised questions. Together with oral-explanations (for example), this combination can be a great alternative to traditional exam strategies (just to mention a few). I'm sure you also have tons of ideas going through your mind right now! Yay!

Let me develop this idea a bit more. Think about a scenario in which we have an undergrad computer science student learning sorting algorithms. These algorithms organise elements of a list into an order (numerical and lexicographical, ascending or descending). These algorithms are important for optimising the efficiency of other algorithms. In this context, we can ask the student to write code to sort a particular list using a particular sorting algorithm during an exam. ChatGPT would perform this task in seconds. No doubt. However, even if the exam is designed that way, we could further assess learning through multiple other approaches, such as: (i) asking the student to explain the code that they submitted; (ii) introducing new tasks to the initial problem and asking them to adjust/extend the current solution; (iii) creating a scenario that introduce bugs/errors to that code and asking students to debug it; (iv) designing a question in which the focus is not on the algorithm itself but on the use of it to solve something more complex (another algorithm that depends on the generated source code, for example, and the impact of that sorting strategy to the whole). For all these new assessment questions and forms, students would be required to apply what they know in a deeper way.

In this context, since 2017 I've been investigating authorship identification problems and their correlation with different cognitive loads (or complexity levels) as part of the research I conduct at The University of Melbourne. Out of many academic integrity issues faced in higher education (i.e., plagiarism, collusion, all forms of cheating in examinations, offering or accepting bribes, falsification of information and so on), academic cheating has been my biggest interest because of its close connection with natural language processing (which was part of my masters and PhD). One of the tools to prevent or detect cheating is authorship identification. Automated authorship identification or attribution is concerned with identifying the true author of an anonymous document given samples of undisputed documents from a set of candidate authors [1].

In short, together with other research colleagues, I've been trying to identify the extent to which stylistic metrics can be used for authorship identification, that is, modelling writing style to determine whether and how a student’s identity can be identified by their writing. This is part of a fascinating study area named Stylometry. Stylometry offers a computer-algorithmic method to quantify text. Stylometry is used to analyse static completed text (i.e., product), rather than the writing process, which involves keystroke dynamics - synchronous analysis performed while students write their answers (assessment of patterns in keystroke dynamics can also be a great approach to minimise issues with academic cheating but one we didn't investigate much in our current studies). Stylometry is based on the linguistic style of the text produced by the author, in our case, students [2]. The style of a completed text can be characterized by measuring a vast array of stylistic metrics. Stylistic metrics include lexical (e.g., word, sentence or character-based statistic variation such as vocabulary richness and word-length distributions), syntactic (e.g., function words, punctuation and part-of-speech), and idiosyncratic style markers (e.g., misspellings, grammatical mistakes; [3,4]). By extracting stylistic metrics from students, we aim at identifying if those authentic (or original) texts have been written by the students who submitted those as part of course assignments. It's important to mention authenticity and originality here since these won't be identified by antiplagiarism tools such as Turnitin (they now offer an originality feature as well but I haven't explored that yet).

What is Authorship Verification and how can this technology help us against ChatGPT?

Stylometry supports Authorship Verification, which aims at determining if new texts (answers to questions, reports and so on) were or were not written by the same author (in our context, students) [5]. Authorship verification is more complex than other attribution problems because a single student may intentionally vary their style from text to text for many reasons or may unconsciously drift stylistically over time (i.e.: vocabulary richness) [5].

In this context, in 2020, we published our findings when examining the relationships between cognitive load and writing product at ASCILITE 2020. Cognitive load reflects the notion that a student’s ability to perform a task depends on the cognitive demands of the task, and the student’s working memory capacity available for task processing [6]. If the cognitive demands required for a given task exceed a student’s available working memory capacity, their ability to perform the task will be affected. Students may take longer to process information, use strategies that require less cognitive load, or make more errors [7-10]. Writing is a complex cognitive task, requiring coordination of long-term knowledge, language skills, motor skills, and working memory. We were curious to identify if students write differently (i.e., change their writing style) depending on the difficulty of the task. So, using Bloom's Taxonomy, we designed 6 questions at different cognitive loads. Bloom's Taxonomy [11] proposes six educational objectives: (1) remember, e.g., retrieval, (2) understand, e.g., interpret and explain, (3) apply, e.g., execute and implement, (4) analyze, e.g., organize and attribute, (5) evaluate, e.g., critique and make judgements, (6) create, e.g., generate and plan. These categories are thought to increasingly demand higher cognitive load from students; [12].

Long story short, in this initial study, we observed seven metrics (or features) from stylometry data across four dimensions to analyse the writing outcome. The results of the Bayesian linear mixed effects models (BLMMs) intervals on the stylometry metrics showed low effect sizes for almost all questions on the stylometry metrics, indicating that the cognitive load has limited effect on the stylometry metrics. Students writing products remained stable and consistent across different cognitive loads.

Next, we evaluated the same dataset using one authorship verification method. If students' answers were stable at different cognitive loads in our first experiment, how well would our chosen authorship verification method perform on that same dataset? Would we be able to identify the same student across questions and tasks designed at different cognitive loads in educational settings? To examine the impact of cognitive load on students’ writings, comparisons were drawn between texts from different cognitive load levels, as shown in the image below.

We transformed probability scores obtained from our authorship verification method to binary answers and considered scores greater than the calculated threshold as a positive answer (i.e., the known and questioned documents are by the same author), and scores lower than the calculated threshold as a negative answer (i.e., the known and questioned documents are by different authors). In our studies, to calculate a threshold for an author at a certain cognitive load (CL), this text was always compared against the text from the same author with CL 1, and a cosine similarity between these two was calculated and used as the threshold. We explain our research method and findings in detail in our article. I'll try to keep this post less technical.

Our findings (and details from our study/investigation) were published at ASCILITE 2022. In short, our results showed that authorship verification methods could provide good results for academic writings with varied cognitive loads; we could identify when texts were written by the same student or not in most of the investigated cases (80% accuracy).

A few days ago we submitted a new paper from new evaluations on the use of automated authorship verification to validate software engineering students’ assessments (e.g., text artefacts and/or reports) through their writing styles. This one is still under review so I won't anticipate/share as much about it as I'd love to. However, if we are right, the results from this study suggest that the authorship verification approach could be successfully used in software engineering education to mitigate academic cheating issues.

Why is this important and what are the implications of our findings to academic cheating?

These findings have important implications for the evaluation of academic cheating in higher education (and other educational environments). Combined with anti-plagiarism tools such as Turnitin, authorship verification methods can support educators to identify academic cheating (as we can extract stylistic writing features from students AND from ChatGPT to build consistent and recognisable profiles, much like a 'fingerprint'). ChatGPT and other AI-powered tools are becoming popular among students but authorship verification methods are also growing in popularity among educational researchers and institutions. In future, as AI tools become more humanised and powerful, authorship verification methods will also continue to evolve and perform better (as it has been happening in the past few years).

What's next for us and our research?

Whether authorship verification in educational settings is used as a tool to educate students, to detect misconduct, or a combination of both, I strongly believe it is here to stay and could be used to promote better education and reflections around ethics and educational issues. Will authorship verification address the current problem? Not completely! I don't think we have a silver bullet for this.

Instead, as I believe AI-powered tools and many new technologies will continue to be available to us all, our focus in the educational context here should continue (or shift) to educational practices and processes. How can we include these technologies to promote an authentic assessment of learning process?

Here at The University of Melbourne, we are currently working on the development of a dataset that combines real data from the answers of 50+ students together with AI-generated answers for the same questions. In a few weeks we hope to have an even better understanding on the extent to which authorship verification methods can detect answers generated by (the same) students, and if we can identify what's generated by AI. Watch this space :)


Is ChatGPT and Authorship Verification the new King Kong vs GodZilla battle in academic cheating? Maybe! Not ChatGPT specifically but any similar AI-generated tool. In any case, get your bucket of popcorn - it's a good one to be watching closely from now on.

In the meantime, I hope others will find the information above useful. Please reach out to me with your own reflections and suggestions. I would love to incorporate extra suggestions into this post.

This text was really written/generated by me, Eduardo :) No ChatGPT was used in this article.

References

[1] V. Keselj, F. Peng, N. Cercone, and C. Thomas, “N-gram-based author profiles for authorship attribution,” Proc. of the Pacific Association for Computational Linguistics, pp. 255—-264, 2003

[2] Calix, K., Connors, M., Levy, D., Manzar, H., MCabe, G., & Westcott, S. (2008). Stylometry for e-mail author identification and authentication. Proceedings of CSIS Research Day, Pace University, 1048–1054.

[3] Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2), 7.

[4] Holmes, D. I., & Kardos, J. (2003). Who was the author? An introduction to stylometry. Chance, 16(2), 5–8.

[5] M. Koppel and J. Schler, “Authorship verification as a one-class classification problem,” ICML ’04 Proceedings of the twenty-first international conference on Machine learning, p. 62, July 2004.

[6]Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive science, 12(2), 257–285.

[7]Beilock, S. L., & DeCaro, M. S. (2007). From poor performance to success under stress: Working memory, strategy selection, and mathematical problem solving under pressure. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(6), 983.

[8]Groen, G. J., & Parkman, J. M. (1972). A chronometric analysis of simple addition. Psychological Review, 79(4), 329.

[9]Parkman, J. M., & Groen, G. J. (1971). Temporal aspects of simple addition and comparison. Journal of Experimental Psychology, 89(2), 335.

[10]Trezise, K., & Reeve, R. A. (2014). Cognition-emotion interactions: Patterns of change and implications for math problem solving. Frontiers in Psychology, 5, 840. https://doi.org/doi: 10.3389/fpsyg.2014.00840.

[11]Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., … Wittrock, M. C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives, abridged edition. White Plains, NY: Longman.

[12]Brizan, D. G., Goodkind, A., Koch, P., Balagani, K., Phoha, V. V., & Rosenberg, A. (2015). Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics. International Journal of Human-Computer Studies, 82, 57–68. https://doi.org/10.1016/j.ijhcs.2015.04.005

[13]Crossley, S. A., Kyle, K., & Dascalu, M. (2019). The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap. Behavior Research Methods, 51(1), 14–27. https://doi.org/10.3758/s13428-018-1142-4

[14]Lu, X., & Ai, H. (2015). Syntactic complexity in college-level English writing: Differences among writers with diverse L1 backgrounds. Journal of Second Language Writing, 29, 16–27. https://doi.org/10.1016/j.jslw.2015.06.003

[15]Hunt, K. W. (1965). Grammatical Structures Written at Three Grade Levels. NCTE Research Report No. 3. Retrieved from https://eric.ed.gov/?id=ED113735

[16]Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.