With the recent surge in AI-produced content, people across the world are having existential crises trying to figure out what’s real and what’s fake. I mean... have you seen deepfakes recently? ChatGPT has exploded over the last month hitting over a million users within the first week of its launch. After the hype died a bit down, it left tons of people trying to figure out how to reverse-engineer the product. Can you even do so? Well... not quite. Reverse engineering literal words requires a little bit of backward thinking.
Luckily, you could use some tools to at least produce a little bit of quantifiable evidence to identify if the text you're looking at was produced by ChatGPT. Over the next few minutes, we'll go through how AI detection actually works and show off a few online tools that you can use to help figure out what you're looking at.
What is ChatGPT?
ChatGPT is a casual-speaking, large language model created by OpenAI – the team behind DALL-E 2. You might've heard about it on TikTok, Twitter, or more recently when the New York public schools banned the entire thing.
The goal of the tool was to combine AI with casual conversation – which the bot does very, very well. You simply ask it any question you want (you should probably stray away from anything illegal though) and you'll get a customized response. You can ask it things like:
- "What is the meaning of life and how can we find purpose in our existence?"
- "Can you tell me a story about a time-traveling detective who solves a crime in the future?"
- "What are the most significant challenges facing humanity today, and what can we do to solve them?"
- "If you could have a conversation with any historical figure, who would it be and why?"
It's kind of crazy. Here's an example:
How Does Artificial Intelligence Work?
All of these AI-generating tools work based on a TON of data. Imagine training a really smart brain to recognize patterns in text (things like analyzing what words normally come after another, syntax, etc) by having them read billions of text articles.
AI is like having that computer brain that can actually think and learn. You're not limited by human intelligence anymore. So imagine you have a robot friend named Robby, and you want him to learn how to play a game like chess. First, you would show Robby how the pieces move on the chess board, and then you would let him play against other computer programs or people. As he plays more and more games, he starts to understand how to play chess better and better, just like how you learn new things when you practice.
The important thing that makes Robby "smart" is something called an "algorithm". This is like a set of instructions that tells Robby how to think about the game and make decisions. For example, the algorithm might tell Robby to look for certain patterns in the chess board, or to prioritize protecting certain pieces over others. As Robby plays more games, he gets better at following these instructions and making good moves.
AI uses many different types of algorithms, but one of the most popular ones is called "machine learning". Machine learning algorithms are amazing. It's what allows our theoretical friend Robby to improve on his own, without having to constantly teach him new things. Machine learning algorithms use something called "training data" to help Robby learn. Training data is like a big collection of examples that show Robby what good chess moves look like. The more training data Robby has, the better he gets at understanding how to play chess.
How Does AI Detection for ChatGPT Work?
Educators, HR managers, and even students have been trying to figure out if random paragraphs of text were originally created by ChatGPT.
The issue is ChatGPT [currently] produces no watermark on the content it produces. If you ask it a question, you could simply copy and paste it & do as you wish.
But once you have something produced (like a 1200-word essay about the industrial revolution) how do you actually check where it came from?
Tools that predict AI analyze how likely each word would be predicted next, based on the previous words to the left.
Remember, this is all about patterns. AI isn't inherently smart – it just recognizes and reproduces existing patterns REALLY well.
If you were typing a message on your phone and the sentence was "The worst part of my day is when I wake up for _". The AI detection tool will use the context to the left (in this case, the words "The worst part of my day is when I wake up for") to predict the next word you might type.
The AI will think back to all of its training data, to find patterns in how words are used in different contexts. It might know, for example, that the word "day" is often used after the words "worst" and "part". The algorithm will then calculate the likelihood of each word being the next predicted word, given the context to the left.
So, in this example, the most probable word that would've been produced after that context was the word: work. Make sense? Now repeat this over the context of an entire paragraph and you’ll see words that vary in how predictable the next words are. The more predicable text is, the higher chance it was written by a bot – since humans inherently have a lot more creativity and spontaneity in their writing patterns compared to robot counterparts.
All of the following tools will use some derivation of this technology combined with other algorithms and language processing models. They basically overcomplicate and reverse-engineer WORD PATTERNS!
Here are a few tools that work pretty well:
Method 1: GPTZero
The first tool is the more complicated one and it's called GPTZero. This free tool was recently released by a Princeton CS student and gives you results describing the randomness of your text & burstiness (random clusters of unpredictable words that humans naturally produce when writing). Simply paste your text & analyze everything you see. You won't get a number or percentage like Originality will give you below, but this is a lot better for in-depth, hyper-individualized testing. Our next tool works better for checking a lot of text, quickly. Here's what GPTZero worked on an angry girlfriend text made from ChatGPT.
Method 2: Originality
As mentioned earlier, Originality is meant for industry-level testing. Simply paste your suspected ChatGPT content and let Originality do all the work for you. If you want to read an in-depth review of how Originality works, check out this review. You'll get a percentage showing if a text is more or less likely to have been written with AI. Remember to do more testing than trusting another AI to tell you if something was made with AI, but the truthful fact is that you really can't "prove" if something used AI without a doubt. The more predictability & simpleness = the higher chance of AI. Also, the more text Originality has to analyze, the better at predicting its origins (or lack of human).
Alternate Method: GLTR
You could also run the text through GLTR, although it's not trained on the newest language models (so it won't be as in-depth). GLTR is another free tool and you can see a "heatmap" type of diagram showing predictability. Here's AI content compared to a professional academic article. The darker words mean the more predicable they are:
Does ChatGPT Have a Watermark?
Not yet! They are adding one soon though. OpenAI is going to be adding a hidden code, similar to a watermark, to the text generated by the model (ChatGPT or any GPT-3 tool). This is done by using a secret code, called a token, that is unique to the GPT-3 model. A token is like a building block of the text, it can be a word, a punctuation mark, or even a part of a word. By adding these tokens to the text, the output will still be coherent and grammatically correct, but it will be unique to the GPT-3 model, and can be used to identify if something was written by it. This is similar to how a watermark is added to a digital image, it is not visible to the naked eye, but it can be used to identify the source of the image.
This is done by analyzing the text for the presence of the unique tokens, which are known only to OpenAI. If these tokens are found in the text, it can be determined that the text was likely generated by GPT-3. This process is similar to how a digital image can be analyzed for the presence of a watermark to determine its source. Whether there will be a public-facing access to this tool is currently unknown. It might become specifically available to educators or academic administrators who would need to test content to determine its true integrity.
It's hard to tell, I don't really know (just figured I'd be honest). The education or professional industry won't be ruined because of this – but it'll certainly cause a disruption. Remember to take any result with a grain of salt, as no tool can 100% determine if AI was used to generate if ChatGPT wrote something. Use your best judgment and always use more than one route when determining the likelihood of something being written by ChatGPT. Also, this article wasn't pro or anti-AI, but rather an educational piece describing a few ways to detect it. The next few years will certainly be exciting!