On Thursday morning, news broke that someone was going around selling student data from the University of Michigan to tech workers that build AI chatbot tech. An employee at Google DeepMind, the company’s AI research hub, said they’d gotten an offer for recordings of lectures, student discussions, and office hours, as well as essays written by seniors and grad students all available for a paltry licensing fee. Now, the University says it was all a misunderstanding, that students gave their consent, and there’s nothing to worry about.
Susan Zhang, an engineer at DeepMind, said that she’d received a sponsored LinkedIn message hawking the information, and offering a free sample of the University of Michigan data to prove its worth.
“I’m reaching out because, based on your profile, you may be working with Large Language models (LLM’s) or natural language processing,” the sales message said. “I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning LLM’s.”
The message offers data from 85 hours worth of lectures, discussion sections, and interviews for $15,595, a second set of 829 papers written by University of Michigan students across various disciplines for $12,595, or a discount package for both data sets at $25,000.
However, the message “was sent out by a new third-party vendor that shared inaccurate information and has since been asked to halt their work,” Colleen Mastony a University of Michigan spokesperson, said in an email. “No transactions or sharing of content occurred by the vendor. Student data was not and has never been for sale by the University of Michigan.” Mastony didn’t share details about who this vendor was, or what, exactly, was inaccurate about the information they offered.
The University may not be selling the data directly, but it is (or was) being offered for sale by an organization called Catalyst Research Alliance, which claims to partner the University of Michigan as well as North Carolina State University. The website offers a sample of the data set, which comes with an essay titled “The Democratic Inadequacies of the European Union,” and what appears to be a recording of a class discussion section.
Catalyst Research Alliance and North Carolina State University did not immediately respond to requests for comment.
According to Mastony, the recordings and the papers were contributed by student volunteers who participated in two decades-old research studies, and none of the data included students’ names or any other personally identifiable information “These particular papers and recordings have long been available for free to academics – again without any identifying information – and have been used as a tool to improve writing and articulation in education,” Mastony said.
“I think it’s worth pursuing which universities are selling student data and what the terms are,” Zhang told Gizmodo in a message on X. “Licensing is better than scraping data without attribution but the attribution pipelines here are likely only built halfway (aka original creators won’t see a dime, whereas the reseller who stores data will capture all the profits).”
Training large language models like the software that runs chatbots such as ChatGPT and Bard requires massive, clearly labeled data sets across various subjects and disciplines. While the University of Michigan data set is small, well-organized content on a narrow swath of subjects could be useful for tuning certain models, particularly tools designed for specific purposes related to academia, formal communication, or for training more general AIs to improve their performance on individual areas of subject matter expertise.
Update 02/15/2024, 5:45 p.m. ET: This story has been updated with comments from the University of Michigan.