ChatGPT web crawler is here and sites are trying to protect themselves - The UpStream

Hero Image

ChatGPT web crawler is here and sites are trying to protect themselves

posted Sunday Aug 13, 2023 by Scott Ertz

ChatGPT web crawler is here and sites are trying to protect themselves

Over the past few months, AI, or the falsely named artificial intelligence, has been the topic du jour in the tech world and across all of the media. There have been positive advancements, like the ability to use language processing to summarize an audio file or to create more accurate subtitles for content. But, the majority of the "advancements" have fallen on the dark side of tech. Now, OpenAI, the company behind ChatGPT has a web crawler to help it index more of the internet, and publishers are panicked.

The problem with AI

Calling AI "artificial intelligence" is a misnomer. There is no intelligence behind the technology - only an incredibly advanced version of the predictive text you see on your mobile keyboard as you type. Based on the text it has already input, the system predicts the most likely and reasonable next word to follow. Your phone keyboard does this by looking at your past typing and learning how you behave. AI systems, however, train on a larger dataset, taking content from anywhere it can find.

The problem with this idea, of course, is that these systems train on content written by real human writers, change it slightly, and then claim ownership and originality and giving the original authors no credit or compensation for their work. Essentially, AI is a middle schooler writing a book report on a book they didn't read and, instead is trying to get around the plagiarism checker their teacher uses by changing just enough words to look original.

One of the things that has made AI less of a threat is that the information in the primary systems is old. That is because, unlike Google and Bing proper, they do not crawl the web constantly. They train off of a collection of data, and that is where it lives until the next set of training data is fed into the system. But OpenAI is hoping to change that with their new web crawler, GPTBot.

GPTBot fears

There has already been a lot of backlash to AI training systems. Creators of all stripes have sued and written about the copyright violations that these systems pose. We've seen complaints from musicians and RIAA, comedians, including Sarah Silverman, graphic artists, and writers and publishers. But, with the creation of an active web crawler, publishers foresee a bigger problem where ChatGPT will not only be able to steal their older written content, it will be able to do it in near real time. This means that recent news topics could potentially be able to be recreated as "new articles" using ChatGPT or Bing Chat.

As a result, publishers have gone into hyperdrive trying to figure out how to handle the looming threat to their livelihoods. Avram has suggested that publishers might put their content behind a paywall, ending the free and open web as we know it. But, for now, publishers are using OpenAI's own documentation about how GPTBot works in order to defeat it.

As it turns out, the GPTBot web crawler will use a pre-defined user agent string, which will give website publishers the ability to either block it entirely, or to change the content in order to trick the system or to purposely damage training data. I'm currently leaning on the second one, as it not only prevents your content from being stolen, but also helps to defeat the effectiveness of the system as a whole. It could be a fun tech experiment to see how well untraining systems work.

Either way, sites should be aware of what is to come.


Login to CommentWhat You're Saying

Be the first to comment!

We're live now - Join us!



Forgot password? Recover here.
Not a member? Register now.
Blog Meets Brand Stats