A: Mostly not. Books that are included in training data are mostly out of copyright. There is also a corpus of "unpublished" novels that is a commonly used training set. However, some published books, for example Lord of the Rings, do appear to have found their way into training data. This has been established by researchers feeding LLMs random sentences from books and seeing if they can complete them. For a few books, it has a fairly high success rate.
Q: If ChatGPT can summarize my book, does that mean OpenAI stole it?
A: Not at all. Is there a plot summary in Wikipedia? That would definitely figure into how your plot is known to an LLM. Was it widely reviewed? Did people write about your book and post these thoughts on the Internet? These would also figure in. Booklength summarization is still a somewhat challenging task. There is no process of structured book summarization from primary texts going on inside the creation of LLMs. That's not how this works.
Q: If ChatGPT can mimic my prose style, does that mean they stole my book?
A: No. Are you famous enough that other people have written about your prose style? Or are you perceived by people who write online as similar to writers who have that level of fame? That may be part of the explanation. Did you have a LiveJoural or a blog? Do you write a lot online? Do you have a Twitter account? Or are you similar to people who do? These are the likely sources of the vectors that generate an imitation of your prose style.
One problem with how LLMs are created is that they were created mainly by people and organizations who don't value books enough to privilege them over random Internet ranting and shitposting. They care little to nothing about your skill at revision and the economy of your prose style. First-draft prose by the unlettered is just fine with them.
Yes, there are books and scientific articles in the mix. But for the most part, the creators of LLMs tried to avoid commercially published work and instead used corpora such as The Common Crawl. (My blog, which has had a copyright notice on it at all times since its founding in 2003, is part of The Common Crawl.)
Q: Pirated copies of my books exist on the Internet. Does that mean they were used to train language models?
A: No. My general impression is that companies training LLMs avoid using pirated texts. The books that seem to have ended up in the training data are likely pirated in many, many copies.
Q: But let's say they DID include my books in the training data. Can ChatGPT now allow total strangers to write books as me?
A: No. Inasmuch as someone can use an LLM to generate a novel-length lump of prose and put your name on the cover and sell it on Amazon under your name, that is a problem with Amazon, and mostly unrelated to "theft" of books. The process mulches the training data, so this would be a bit like reconstructing a tree from a bag of mulch. Emulation of style by LLMs tends to be superficial and get it into the right subject area but is not very precise beyond that.
To do anything remotely like building an AI thing that can accurately write books as you, one would need to do something called "fine-tuning." I know someone who is trying something like this, to create an AI system that can write as him, but he has a lot of money and a lot of technology and has not yet announced his success.
Q: Is what OpenAI and similar companies are doing "theft" of the work of writers?
A: Maybe. But not in the usual sense of piracy and plagiarism. What they are doing more closely mimics fair use and the way creative people are taught their art in school than the usual crimes against literary property. But the makers of LLMS are laying claim to a large cultural commons that belongs to all of us.
Publishing contracts and DMCA claims are a bit beside the point. The legal rights we collectively need have not yet been defined. And we need these rights need to be defined as soon as possible. Also, we need to imagine how we might be collectively be compensated.
Illustration: stenciled spray paint on wood by Kathryn Cramer, Toronto, 2021.