Post AbyBGNnd3Sc6vF63Bg by laprice@mastodon.social
 (DIR) More posts by laprice@mastodon.social
 (DIR) Post #AbyBGNnd3Sc6vF63Bg by laprice@mastodon.social
       2023-11-19T14:00:03Z
       
       0 likes, 0 repeats
       
       @simon what would it take to ship an ai code assistant with the standard python that was trained on the python standard library as it's base corpus?
       
 (DIR) Post #AbyBGOd1yLjLUf96J6 by simon@fedi.simonwillison.net
       2023-11-19T16:17:25Z
       
       0 likes, 0 repeats
       
       @laprice Not entirely sure what you mean by trained on the Python standard library - do you mean the code itself, or the documentation, or both?Many models out there today are trained on a TON of Python code that includes the standard library already, because it turns out training on code makes them more effective at non-code tasks as well
       
 (DIR) Post #AbyBRIIVqoz92mkGB6 by simon@fedi.simonwillison.net
       2023-11-19T16:18:33Z
       
       0 likes, 0 repeats
       
       @laprice An AI model trained exclusively on the Python standard library wouldn't be any good, you need trillions of tokens before models start being able to do useful things with the statistical patterns they've learnedYou could fine-tune an existing model like Llama 2 on the Python standard library, I don't know what effect that would have though
       
 (DIR) Post #AbyKaYDwuCjWdH6Dsu by laprice@mastodon.social
       2023-11-19T18:02:07Z
       
       0 likes, 0 repeats
       
       @simon My thinking was to have a module that comes with the python distribution where someone has already built the model specific to that version of python as part of the release process. So that people can go straight to building tools with it.And maybe the answer is to use LLama2 as a base model.But I wonder if more training data is always better.And the question of what is the smallest dataset that an effective model can be trained on is an interesting one.