Post AdFCu70293MIu3hgOm by AnthonyBaker@mastodon.social
(DIR) More posts by AnthonyBaker@mastodon.social
(DIR) Post #AdF3cqcmsGePbgWVU0 by simon@fedi.simonwillison.net
2023-12-27T17:33:56Z
0 likes, 0 repeats
Does this new NY Times lawsuit about OpenAI training on their data mean that the details of that training data might come out in discovery? https://apnews.com/article/nyt-new-york-times-openai-microsoft-6ea53a8ad3efa06ee4643b697df0ba57
(DIR) Post #AdF4fbxrw69mxkGXGS by luis_in_brief@social.coop
2023-12-27T17:45:51Z
0 likes, 0 repeats
@simon Could, though that would likely take years—OpenAI will try to get it killed on summary judgment before that, which could take some time.
(DIR) Post #AdF82QUn2kmFunEcvw by fpbhb@mastodon.social
2023-12-27T18:23:45Z
0 likes, 0 repeats
@simon Is that even possible information theory-wise?
(DIR) Post #AdF8S5cFIR5MgdxXwO by simon@fedi.simonwillison.net
2023-12-27T18:28:25Z
0 likes, 0 repeats
@fpbhb OpenAI should have detailed internal records on exactly what they used to train GPT-4 et al
(DIR) Post #AdF96DLbyA6cFXJOQy by fpbhb@mastodon.social
2023-12-27T18:35:47Z
0 likes, 0 repeats
@simon Ah, misunderstanding. I though you meant that they somehow retrieve enough of the training material from the model to have a copyright case. OpenAI‘s records revealed will be interesting for sure, although I doubt what they did is much different from others.
(DIR) Post #AdFCu70293MIu3hgOm by AnthonyBaker@mastodon.social
2023-12-27T19:18:09Z
0 likes, 0 repeats
@simon The thing with all this is that websites across the internet literally provide a standardized, well-structured JSON API format in their source code that explicitly describes the content, authors, images, videos, keywords, data types and text for use by SEO bots. It’s a red carpet. Pretty sure Open AI has used a third party “scraper” that hovers this info as one of its sources. They did for GPT-3.