Post AUeTTDbitMeGbx73Cq by resing@social.coop
 (DIR) More posts by resing@social.coop
 (DIR) Post #AUa61wEmonLCQknjua by simon@fedi.simonwillison.net
       2023-04-12T16:21:09Z
       
       0 likes, 2 repeats
       
       Dolly 2.0 is a really big deal: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm"The first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use"My notes so far on trying to run it: https://til.simonwillison.net/llms/dolly-2
       
 (DIR) Post #AUa6C2BZYejDm470oS by simon@fedi.simonwillison.net
       2023-04-12T16:21:25Z
       
       0 likes, 0 repeats
       
       One of the most exciting things about Dolly 2.0 is the fine-tuning instruction set, which was hand-built by 5,000 Databricks employees and released under a CC licenseHere's that training set in Datasette Lite: https://lite.datasette.io/?json=https://github.com/databrickslabs/dolly/blob/master/data/databricks-dolly-15k.jsonl#/data/databricks-dolly-15k?_facet=category
       
 (DIR) Post #AUaJj8dhsmz2kbO4PI by ironicsans@mastodon.social
       2023-04-12T18:54:21Z
       
       0 likes, 0 repeats
       
       @simon @film_girl Typo in the training set. Is that column the model's response? I wonder where it pulled the misspelled name from.
       
 (DIR) Post #AUaVOfbzIR2cNgacDY by Defiance@sfba.social
       2023-04-12T21:05:12Z
       
       0 likes, 0 repeats
       
       @simon This is a dataset for use by other product builders. Dolly does not have a user interface like Bing or ChatGPT. Is that right? I love that it's open source! Way more trustworthy output this way.
       
 (DIR) Post #AUaXjEmWRe7r9tS5Vw by simon@fedi.simonwillison.net
       2023-04-12T21:31:41Z
       
       0 likes, 0 repeats
       
       @ironicsans @film_girl no, none of the data in there is generated by a model - it was all manually entered by Databricks staff members
       
 (DIR) Post #AUaXwyj6vxdlQ2POU4 by simon@fedi.simonwillison.net
       2023-04-12T21:34:02Z
       
       0 likes, 0 repeats
       
       @Defiance no UI yet but you can call a Python function with a prompt and get back a response, so plugging it into a UI should be pretty easy
       
 (DIR) Post #AUaY8Yf1fizo3I1Gng by Defiance@sfba.social
       2023-04-12T21:34:19Z
       
       0 likes, 0 repeats
       
       @simon Thanks!
       
 (DIR) Post #AUasgTPhgY3WhvIn0S by jadp@mastodon.social
       2023-04-13T01:26:05Z
       
       0 likes, 0 repeats
       
       @simon @donmelton the fine-tuning data set is open source but I can’t find any mention of the original training set. Do you know anything about that?
       
 (DIR) Post #AUatQWLV7E5L7eIFKS by simon@fedi.simonwillison.net
       2023-04-13T01:34:29Z
       
       0 likes, 0 repeats
       
       @jadp @donmelton I believe the training set for Pythia is "The Pile" - some details in the Pythia paper https://arxiv.org/pdf/2304.01373.pdf and on https://pile.eleuther.ai/ - it's 825GB of data from a bunch of sources, most fully described in https://arxiv.org/pdf/2101.00027.pdf
       
 (DIR) Post #AUauOgoBLRKOwkOot6 by jadp@mastodon.social
       2023-04-13T01:45:33Z
       
       0 likes, 0 repeats
       
       @simon @donmelton thank you very much for replying
       
 (DIR) Post #AUayn9E7jl0fqqS3WK by bigbee@techhub.social
       2023-04-13T02:34:51Z
       
       0 likes, 0 repeats
       
       @simon Databricks was started by the team that developed Spark, and they have a done a lot of work on GPU optimization. You may have better luck firing up a Databricks instance on AWS or Azure.
       
 (DIR) Post #AUbFVaoS8IdvPqev20 by heathr@octodon.social
       2023-04-13T05:41:58Z
       
       0 likes, 0 repeats
       
       @simon @pomeranian99 surprising no one, they gave it a female name
       
 (DIR) Post #AUbIi6EtPpOeDkgJNo by callionica@mastodon.social
       2023-04-13T06:17:50Z
       
       0 likes, 0 repeats
       
       @simon It’s really interesting to see your research, but my first response was that is a really slow way to search Wikipedia!
       
 (DIR) Post #AUbx3HhlRwBEwNbn96 by simon@fedi.simonwillison.net
       2023-04-13T13:49:54Z
       
       0 likes, 0 repeats
       
       @callionica wow, that really is a way closer match then I would have expected! I know Wikipedia is in the "Pile" training set they used but I would expect it to be much more of a rephrasing than that, due to this effect: https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
       
 (DIR) Post #AUbxEEdJD94AbBkJyC by simon@fedi.simonwillison.net
       2023-04-13T13:51:16Z
       
       0 likes, 0 repeats
       
       @callionica the Pile also has a common crawl scrape of the web, I wonder if this was caused because a bunch of other sites republish content from Wikipedia and it ended up in there multiple times, increasing the statistical likeliness it would be combined together like that
       
 (DIR) Post #AUd4DpPRswiNCryaNk by resing@social.coop
       2023-04-14T02:44:58Z
       
       0 likes, 0 repeats
       
       @simon very cool. I’ve tried a couple ways of accessing it in azure so far with no success but I’ve never tried huggingface before this
       
 (DIR) Post #AUdUnuSJCOAG5eBBcO by fell@ma.fellr.net
       2023-04-14T07:42:41Z
       
       0 likes, 0 repeats
       
       @simon It surprises me, that for something as performance critical as LLMs people use an inefficient language like #Python, where everything, especially GPU access, goes through multiple abstraction layers.
       
 (DIR) Post #AUeTTDbitMeGbx73Cq by resing@social.coop
       2023-04-14T19:02:37Z
       
       0 likes, 0 repeats
       
       @simon I found this Google Collab setup that returns the first man on the moon prompt in a second, but only with the short answer. Using the 8GB version. No setup required! https://colab.research.google.com/drive/1A8Prplbjr16hy9eGfWd3-r34FOuccB2c?usp=sharing#scrollTo=qQIBoZHdGF4I
       
 (DIR) Post #AUh3EWUUO4ZDUzwOuG by mori@mastodon.au
       2023-04-16T00:52:29Z
       
       0 likes, 0 repeats
       
       @simon This toot got quoted in Ars Technica. Nice work☺️8th paragraphhttps://arstechnica.com/information-technology/2023/04/a-really-big-deal-dolly-is-a-free-open-source-chatgpt-style-ai-model/