

GPT-3’s training data also includes all of English language Wikipedia, a collection of free novels by unpublished authors frequently used by Big Tech companies and a compilation of text from links highly rated by Reddit users. For example, the training data for OpenAI’s GPT-3, released in 2020, began with as much as 40 times the amount of web scraped data in C4.

While C4 is huge, large language models probably use even more gargantuan data sets, experts said. Note: Some websites were unable to to be categorized and, in many cases, are no longer accessible. Some websites in this data set contain highly offensive language and we have attempted to mask these words. The Post believes it is important to present the complete contents of the data fed into AI models, which promise to govern many aspects of modern life. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. Is your website training AI?Ī web crawl may sound like a copy of the entire internet, but it’s just a snapshot, capturing content from a sampling of webpages at a particular moment in time. pizza joint was a front for pedophiles, were also present. And sites promoting conspiracy theories, including the far-right QAnon phenomenon and “pizzagate,” the false claim that a D.C. 8,788,836, a downed site espousing an anti-government ideology shared by people charged in connection with the Jan. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals. Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site No. Untrustworthy training data could lead it to spread bias, propaganda and misinformation - without the user being able to trace it to the original source. 993, an anti-immigration site that has been associated with white supremacy.Ĭhatbots have been shown to confidently share incorrect information, but don’t always offer citations. 159, a well-known source for far-right news and opinion and No. 65, the Russian state-backed propaganda site No. Meanwhile, we found several media outlets that rank low on NewsGuard’s independent scale for trustworthiness: RT.com No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.

But half of the top 10 sites overall were news outlets: No.

The News and Media category ranks third across categories. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data. Tech companies have grown secretive about what they feed the AI. If it aces the law school admissions test, for example, it’s probably because its training data included thousands of LSAT practice sites. This text is the AI’s main source of information about the world as it is being built, and influences how it responds to users. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet. AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.Ĭhatbots cannot think like humans: They do not actually understand what they say.
