The New York Times carries the banner against generative AI plagiarism

posted Monday Aug 21, 2023 by Scott Ertz

Over the past few months, there has been a lot of talk about the ethics surrounding how generative AI works on the backend. For those of us who create content online, it can appear a lot like plagiarism for a computer system to inject our content and then rearrange it to create what it calls "new content." Creators have been looking for ways to prevent these systems from injecting their content, and The New York Times has taken matters into its own hands, changing its usage policy and filing suit against OpenAI.

The problem with Generative AI

The biggest issue with generative AI comes from the core technology underpinning it. Before a generative AI system is able to work, it has to ingest and process an absolute ton of content. To get enough content to scan, the systems need to scrape data from the internet. Text systems read the web and ingest news articles, stories, books, and more. Visual systems ingest content from Google and Bing image searches, as well as sites like DeviantArt.

Using the information it collects from the internet, the systems then generate similar content, but with words in a new order. These systems don't actually learn information about topics and expound on those topics. Instead, they work similarly to a middle schooler writing a book report about a book they didn't read. They take something someone else wrote and move the words around just enough to avoid plagiarism claims.

But, these generative AI systems don't give any credit to the original authors or publishers. And, what's worse, they generate revenue off of the content that they "generate" from this harvested data. Obviously, authors, artists, publishers, and more have been looking for ways to protect their content and their livelihoods. It looks like The New York Times is leading the charge.

Prohibiting AI training

The New York Times has changed its terms of service to prevent systems from scraping data from their website or mobile applications. The language reads,

(2) use robots, spiders, scripts, service, software or any manual or automatic device, tool, or process designed to data mine or scrape the Content, data or information from the Services, or otherwise use, access, or collect the Content, data or information from the Services using automated means;
(3) use the Content for the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.

The company is setting itself up to protect its content going forward, but they aren't going to stop just at the future. In addition to changing their terms of service to prevent scraping data, The New York Times is planning to file suit against OpenAI, the creators of ChatGPT. The lawsuit would reportedly ask for a large fine for each infraction of what they are considering copyright violation - $150K.

If this lawsuit goes through, and The New York Times wins it, it will make for a major change in the way generative AI systems work. OpenAI would need to purge all of its content from the newspaper, which could possibly require them to purge all of its database and start over. Plus, if OpenAI loses, it would likely mean that other publishers would make the same claims and make the same demands of other generative AI systems.

The New York Times carries the banner against generative AI plagiarism - The UpStream