This site tells you if your photos are being used to train AI

Image-generating AIs like Stable Diffusion or Midjourney have become very popular in a very short time. However, it is difficult to know exactly which images these artificial intelligences are trained on.

Have your photos on social networks been used to develop artificial intelligence? That being said, the question may seem absurd. Who would have imagined that photos posted on Facebook or Instagram could be used to teach AIs what a forest looks like?

However, it’s a fact: image-creating AIs are trained on a huge corpus of photos found on the Internet—perhaps yours. The question is even more important if you are a creator on social networks and want to make sure that copyright is not violated. There is a tool to find out if this is the case: HaveIBeenTrained.

See the database used to develop the AIs

HaveIBeenTrained lets you consult Laion 400M and Laion 5B, two huge databases of 400 million and 5 billion photos, respectively, used to train Stable Diffusion and Imagen AIs. These are the two largest databases of images described with text, allowing AIs to better connect the two ideas.

To find out if one of the paintings you’ve shared online is part of these two huge databases, nothing could be easier: just search by image or text. request for ” forest image will show you all the images available in the database matching that image.

Case study on HaveIBeenTrained // Source: HaveIBeenTrained

But HaveIBeenTrained is mainly aimed at artists who have a presence on social networks and whose works can be absorbed by Laion. The site offers thus Artists should search this database for links to their work and request their removal “, we can read in the description. “ We partner with Laion, which collects these databases, to ensure the availability of future models [d’intelligence artificielle] not taught by extracted works. »

The site is specifically addressed to artists. In early January 2023, three artists, including designer Sarah Andersen, known for her Instagram comics, filed a complaint against Midjourney and Stable Diffusion. These AIs use billions of images from the internet to train. infringed the copyrights of millions of artists […] those who do not consent and do not receive compensation. »

Using HaveIBeenTrained, it’s really easy to understand that Sarah Andersen’s images appear in the Laion database.

Sarah Andersen's comics used for training by artificial intelligences // Source: HaveiBeenTrained
Sarah Andersen’s comics used for training by artificial intelligences // Source: HaveiBeenTrained

What do we find in this database?

Until now, it was very difficult to know exactly what was in this huge database of 5 billion entries. Laion 400-M and Laion 5-B are assembled with complex fully automated procedures that do not allow them to sort the images to be integrated. And that sometimes means that some photos aren’t necessarily copyright-free.

The photo agency Getty Image recently paid the price: it realized that its artificial intelligence was trained on a large number of its photos to reproduce the famous copyright banner. Getty Image Fixed Diffusion vs. “cillegally scanned and analyzed millions of copyrighted photographs. »

A quick test allows you to understand the variety of what can be found there. There are not only landscape images, but also book covers, advertising images, but also excerpts from Facebook posts with their names clearly identified, and even photos of anonymous individuals published on Skyblog.

During our research, we even came across some pornographic photos, proving that there are lots and lots of material in these databases, and everyone would do well to check out what’s out there.


We need you to build the future of Numerama: take our survey!

Leave a Reply

Your email address will not be published. Required fields are marked *