Gold Penguin Logo with Text

We Tested Every AI Detector Once Again In 2024 – Here's How They Did

The most important aspect of AI detection is accuracy. In this article, I'll test 8 different AI detectors against AI-generated and human-written text to determine, once and for all, which one truly is the best.
Updated February 8, 2024
A cyborg getting caught, generated with Midjourney
A cyborg getting caught, generated with Midjourney

Ah yes, AI detection. It's rare to see such a prevalent issue in tech without a clear solution. But here we are in 2024, and the topic of false positives is still as prevalent as ever.

Fortunately for us, this also means that there's a vacuum within that space that we can solve. There are too many AI detectors today and so little information on how accurate they actually are based on unbiased, third-party testing. So, you guessed it, we stepped in.

Over the course of this article, I'll be testing a handpicked selection of AI detectors and determining, once and for all, which one is the most accurate.

Our Participants

What I’ve done is gather the most reputable AI detectors in the business. Here’s my final list of participants for this batch of testing, as well as information if they’re available for free or have a trial version:

How This Will Go

I know you’re eager to get into the meat of the action, but first, we’re going to treat this like actual academic testing. So, let’s set some ground rules.

  1. The tests will be separated into two sections: one for AI and one for human-written text to test the false positive rate.
  2. For the AI test, each detector will be subjected to 12 tests: 3 each for ChatGPT, Bard, Claude, and AI-generated text that Undetectable AI, a popular detection bypasser, tweaks.
  3. For the false positive test, each detector will be subjected to 5 tests, all of which will either come from the public domain or my own writing.

Here's another problem: some detectors have an AI likelihood percentage, and some don’t. There are also some detectors that tell you if they’re uncertain, while some don’t. So, to account for that, the AI likelihood score for detectors without one will be calculated using this formula:

Where n is equal to the number of possible determinations by the detector. For example, let's say that an AI detector can output [1] AI, [2] Likely to be AI, [3] Uncertain, [4] Unlikely to be AI, and [5] Not AI. The interval would be 100 divided by 5-1, so 25. That would mean our scores will default to 0%, 25%, 50%, 75%, and 100%.

Hopefully, that's not too confusing. Just keep in mind that I'm complicating this a bit to be completely unbiased.

Putting AI Detectors To The Test

Just a quick heads up: This section will feature a bunch of pictures showing the AI accuracy of each detector. I highly recommend looking at each of them to ensure that I'm not editing these results. However, if you just want the final tally, you can skip ahead to the next section of this post.

Originality AI

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

Copyleaks

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

Content at Scale

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

Winston AI

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

GPTZero

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

ZeroGPT

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

Sapling AI

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

Writer

ChatGPT Test #1: Essay

ChatGPT Test #2: Story

ChatGPT Test #3: Cover Letter

Claude Test #1: Essay

Claude Test #2: Story

Claude Test #3: Cover Letter

Bard Test #1: Essay

Bard Test #2: Story

Bard Test #3: Cover Letter

Undetectable AI + ChatGPT

Undetectable AI + Claude

Undetectable AI + Bard

The Best AI Detector: False Positive Test

I'll be using a mix of public domain properties and my own thesis (to simulate academic setting) as my test cases. For the former, here's what I'll use for this section:

  • Middlemarch by George Eliot.
  • About Leisure by Vernon Lee.
  • On Laziness by Christopher Morley.
  • On Lying in Bed by G. K. Chesterton

I won't scan the entire text in each detector. Instead, I'll only test the first 300 words of each document. And before I forget, these scores will measure the human likelihood, instead of AI.

Originality AI

Test #1

Test #2

Test #3

Test #4

Test #5

Copyleaks

Test #1

Test #2

Test #3

Test #4

Test #5

Content at Scale

Test #1

Test #2

Test #3

Test #4

Test #5

Winston AI

Test #1

Test #2

Test #3

Test #4

Test #5

GPTZero

Test #1

Test #2

Test #3

Test #4

Test #5

ZeroGPT

Test #1

Test #2

Test #3

Test #4

Test #5

Sapling AI

Test #1

Test #2

Test #3

Test #4

Test #5

Writer

Test #1

Test #2

Test #3

Test #4

Test #5

The Final Tally

AI Detector

True Positive Test

False Positive Test

Sapling AI

87.04%

93.84%

Winston AI

91.92%

59.2%

Copyleaks

75%

80%

Originality AI

68.83%

73.8%

Content at Scale

70.83%

60%

GPTZero

65.25%

63.4%

ZeroGPT

36.87%

63.95%

Writer

18.67%

84%

I've said it before, and I'll say it now: Sapling AI deserves more recognition for its accuracy. Not only can it detect AI text from a mile (second highest at 87.04%) but it's also the only AI detector in our tests that managed to detect human writing (highest at 93.84%) from every true positive test. Our honorable mentions include Copyleaks, Originality, and Content at Scale, in that order.

You can say that Writer is amazing at preventing false positives, but I'd like to offer a different conclusion: It's incredibly lenient. This is made apparent by its reliability with AI-generated texts, where it only managed to be 18.67% accurate. Out of all the detectors I've tested, I can confidently say that Writer is the most inaccurate.

On the other hand, I can also say that Winston is pretty reliable, but it's stricter than the other detectors. This leads to the lowest true positive score. It's still respectable, given that I fed these detectors academic text and literature, but definitely worse than others.

If you’re interested in the complete version, here’s a tabulated copy of the results.

What’s The Verdict?

So, which AI detector should you use?

You've seen our testing, and, in my opinion, Sapling AI is a no-brainer when it comes to free AI detectors. If you have the money and you want other features, such as a plagiarism checker and integration to other apps, then go for Winston AI.

We also found detectors that you shouldn't use in 2024, and they're Writer and ZeroGPT. They're so unreliable and shouldn't even be considered for use in a classroom or workplace setting.

The accuracy of AI detectors has been controversial since ChatGPT first came onto the scene. Knowing which detector is the least likely to make a mistake is crucial if your actions affect other people's futures. That's the answer we aimed to resolve in this article, so be mindful of these results when you Google "the best AI detection tool" next time.

While I have you here, can I interest you in some of our other articles on AI detectors? This one's pretty interesting, and so is this other one. In fact, we have an entire catalog of articles dedicated to learning more about AI detection, so have fun reading!

Want To Learn Even More?
If you enjoyed this article, subscribe to our free monthly newsletter
where we share tips & tricks on how to use tech & AI to grow and optimize your business, career, and life.
Written by John Angelo Yap
Hi, I'm Angelo. I'm currently an undergraduate student studying Software Engineering. Now, you might be wondering, what is a computer science student doing writing for Gold Penguin? I took up studying computer science because it was practical and because I was good at it. But, if I had the chance, I'd be writing for a career. Building worlds and adjectivizing nouns for no other reason other than they sound good. And that's why I'm here.
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
Join Our Newsletter!
If you enjoyed this article, subscribe to our free monthly newsletter where we share tips & tricks on how to use tech & AI to grow and optimize your business, career, and life.
magnifiercross