ChatGPT web crawler is here and sites are trying to protect themselves

posted Sunday Aug 13, 2023 by Scott Ertz

Over the past few months, AI, or the falsely named artificial intelligence, has been the topic du jour in the tech world and across all of the media. There have been positive advancements, like the ability to use language processing to summarize an audio file or to create more accurate subtitles for content. But, the majority of the "advancements" have fallen on the dark side of tech. Now, OpenAI, the company behind ChatGPT has a web crawler to help it index more of the internet, and publishers are panicked.

The problem with AI

Calling AI "artificial intelligence" is a misnomer. There is no intelligence behind the technology - only an incredibly advanced version of the predictive text you see on your mobile keyboard as you type. Based on the text it has already input, the system predicts the most likely and reasonable next word to follow. Your phone keyboard does this by looking at your past typing and learning how you behave. AI systems, however, train on a larger dataset, taking content from anywhere it can find.

The problem with this idea, of course, is that these systems train on content written by real human writers, change it slightly, and then claim ownership and originality and giving the original authors no credit or compensation for their work. Essentially, AI is a middle schooler writing a book report on a book they didn't read and, instead is trying to get around the plagiarism checker their teacher uses by changing just enough words to look original.

One of the things that has made AI less of a threat is that the information in the primary systems is old. That is because, unlike Google and Bing proper, they do not crawl the web constantly. They train off of a collection of data, and that is where it lives until the next set of training data is fed into the system. But OpenAI is hoping to change that with their new web crawler, GPTBot.

GPTBot fears

There has already been a lot of backlash to AI training systems. Creators of all stripes have sued and written about the copyright violations that these systems pose. We've seen complaints from musicians and RIAA, comedians, including Sarah Silverman, graphic artists, and writers and publishers. But, with the creation of an active web crawler, publishers foresee a bigger problem where ChatGPT will not only be able to steal their older written content, it will be able to do it in near real time. This means that recent news topics could potentially be able to be recreated as "new articles" using ChatGPT or Bing Chat.

As a result, publishers have gone into hyperdrive trying to figure out how to handle the looming threat to their livelihoods. Avram has suggested that publishers might put their content behind a paywall, ending the free and open web as we know it. But, for now, publishers are using OpenAI's own documentation about how GPTBot works in order to defeat it.

As it turns out, the GPTBot web crawler will use a pre-defined user agent string, which will give website publishers the ability to either block it entirely, or to change the content in order to trick the system or to purposely damage training data. I'm currently leaning on the second one, as it not only prevents your content from being stolen, but also helps to defeat the effectiveness of the system as a whole. It could be a fun tech experiment to see how well untraining systems work.

Either way, sites should be aware of what is to come.

F5 Live: Refreshing Technology

August 13, 2023 - Episode 652

Sunday Aug 13, 2023 (02:06:03)

Description

This week, Cortana is dead, Netflix is bringing games to the TV, ChatGPT is crawling the web, and Disney is raising prices... again.

Participants

Scott Ertz

Host

Scott is a developer who has worked on projects of varying sizes, including all of the PLUGHITZ Corporation properties. He is also known in the gaming world for his time supporting the rhythm game community, through DDRLover and hosting tournaments throughout the Tampa Bay Area. Currently, when he is not working on software projects or hosting F5 Live: Refreshing Technology, Scott can often be found returning to his high school days working with the Foundation for Inspiration and Recognition of Science and Technology (FIRST), mentoring teams and helping with ROBOTICON Tampa Bay. He has also helped found a student software learning group, the ASCII Warriors, currently housed at AMRoC Fab Lab.

Avram Piltch

Host

Avram's been in love with PCs since he played original Castle Wolfenstein on an Apple II+. Before joining Tom's Hardware, for 10 years, he served as Online Editorial Director for sister sites Tom's Guide and Laptop Mag, where he programmed the CMS and many of the benchmarks. When he's not editing, writing or stumbling around trade show halls, you'll find him building Arduino robots with his son and watching every single superhero show on the CW.

Opening

Powered by TeknoAXE

Nifty Gifties

Powered by Microsoft Store

Cortana is dead in Windows 11 but her corpse remains on your PC

Cortana was once one of Microsoft's biggest focuses. Starting as the spotlight feature of Windows Phone, she grew to live within and control everything in the Microsoft ecosystem. But now, Microsoft has changed course, renaming some of the former Cortana services and destroying the rest. And, like the company too often does, leaving the consumers to clean up the mess.

Piltch Point with Avram Piltch

Powered by PureVPN

Extra Life

Powered by Eksa

Netflix launches mobile game controller, Netflix Games coming to TVs

Netflix has been trying to differentiate itself from the competition. Since Netflix created its streaming business, the industry has become incredibly crowded. Everyone needs a hook, and Netflix has gone all in on gaming as that differentiator. But, the mobile game business has not been the company's only intention, and the next phase of Netflix Games hit the App Store this week with a mobile game controller.

News From the Tubes

Powered by Malwarebytes