Artificial intelligence

News Tests show AI-based image generation system can sometimes generate copies of training data

Diffusion models memorize individual training examples and generate them at test time. Left: Image from the Stable Diffusion training set (licensed under CC BY-SA 3.0). Right: Stable diffusion generation when prompted by “Ann Graham Lotz”. The reconstructions are almost identical (`2 distance = 0.031). Credit: arXiv (2023). DOI: 10.48550/arxiv.2301.13188

A team of computer scientists from Google, DeepMind, ETHZ, Princeton University, and UC Berkeley has found that AI-based image generation systems can sometimes generate copies of the images used to train them.The group published a paper describing several image generation software systems in arXiv Preprint server.

Image generation systems such as Stable Diffusion, Imagen, and Dall-E 2 have been in the news recently because of their ability to generate high-resolution images based solely on natural language cues. Such systems have been trained using thousands of images as templates.

In this new effort, researchers, some of whom were part of the team that created one of the systems, discovered that these systems sometimes make a very important mistake. Instead of generating new images, the system simply spits out one of the images from its training data. This happens quite often—in their test work, they found more than 100 instances in 1,000 image returns.

This is a problem because datasets are often scraped from the internet, and many are copyrighted. During testing, the team found that about 35 percent of copied images carried copyright notices. About 65% were images that were not explicitly notified but looked like they might fall under general copyright protection law.

The researchers note that most AI-based image generation systems have a processing stage where noise is added to prevent images from coming back from the dataset, pushing the system to create something new. They also noticed that sometimes the system added noise to the copied image, making it harder to tell if it was a copy.

The team concluded that producers of such products need to add another safeguard to prevent copies from being returned. They note that a simple tagging mechanism should do the trick.

More information:
Nicholas Carlini et al., Extracting Training Data from Diffusion Models, arXiv (2023). DOI: 10.48550/arxiv.2301.13188

Journal information:

© 2023 Science X Network

quote: Tests show AI-based image generation system can sometimes generate copies of training data Retrieved Feb 2, 2023 from (February 2, 2023) – trainer.html

This document is protected by copyright. Except for any fair dealing for private study or research purposes, no part may be reproduced without written permission. The content is for reference only.

Related Articles

Back to top button