OpenAI's GPTBot Will Scrape Your Website to Train Its AI, Unless You Opt Out

2023-08-09 03:15

With its new GPTBot, AI models from OpenAI can crawl the web for new information,

OpenAI's GPTBot Will Scrape Your Website to Train Its AI, Unless You Opt Out

With its new GPTBot, AI models from OpenAI can crawl the web for new information, meaning your website and its content can be scraped to train artificial intelligence—unless you opt out.

"Web pages crawled with the GPTBot user agent may potentially be used to improve future models," OpenAI says. "Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety."

OpenAI notes that GPTBot will not breach sites that require paywall access, a nod to a recent controversy where ChatGPT Plus members using "Browse with Bing" were able to bypass paywalls to read articles. GPTBot will also filter out sources "known to gather personally identifiable information, or have text that violates our policies."

To prevent GPTBot from mining your website, OpenAI provides two lines of code you can copy and paste into your own site's code that will tell it to buzz off. Another code snippet will give GPTBot access to "only parts of your site," a middle-ground option between fully blocking it and keeping the gates wide open.

This likely only applies to sites you own and operate, meaning anything you publish on a social media website, or a blogging site like Substack or Medium, is still fair game.

The experience of using ChatGPT does not seem to have been immediately affected by the change. In the past, ChatGPT has operated on a fixed dataset that only goes up to 2021. As of this writing, it still cannot answer questions regarding current events.

For example, I asked it how the US Women's Soccer Team performed in the 2023 World Cup. It replied, "I'm sorry for the inconvenience, Arnold Schwarzenegger, but as of my last training data in September 2021, I don't have the ability to access real-time information or events that occurred after that date." (I told it to refer to me as Arnold Schwarzenegger as a joke.)

However, over time GPTBot will theoretically improve the quality of ChatGPT's answers, as it is constantly running in the background to improve its intelligence. PCMag previously wrote about the importance of giving non-paywalled publications the ability to opt out of AI scraping, or to compensate them for regurgitating their reporting in AI-generated answers. GPTBot could be that solution, though OpenAI's blog post is light on details.