Post AUeTTDbitMeGbx73Cq by resing@social.coop
(DIR) More posts by resing@social.coop
(DIR) Post #AUa61wEmonLCQknjua by simon@fedi.simonwillison.net
2023-04-12T16:21:09Z
0 likes, 2 repeats
Dolly 2.0 is a really big deal: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm"The first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use"My notes so far on trying to run it: https://til.simonwillison.net/llms/dolly-2
(DIR) Post #AUa6C2BZYejDm470oS by simon@fedi.simonwillison.net
2023-04-12T16:21:25Z
0 likes, 0 repeats
One of the most exciting things about Dolly 2.0 is the fine-tuning instruction set, which was hand-built by 5,000 Databricks employees and released under a CC licenseHere's that training set in Datasette Lite: https://lite.datasette.io/?json=https://github.com/databrickslabs/dolly/blob/master/data/databricks-dolly-15k.jsonl#/data/databricks-dolly-15k?_facet=category
(DIR) Post #AUaJj8dhsmz2kbO4PI by ironicsans@mastodon.social
2023-04-12T18:54:21Z
0 likes, 0 repeats
@simon @film_girl Typo in the training set. Is that column the model's response? I wonder where it pulled the misspelled name from.
(DIR) Post #AUaVOfbzIR2cNgacDY by Defiance@sfba.social
2023-04-12T21:05:12Z
0 likes, 0 repeats
@simon This is a dataset for use by other product builders. Dolly does not have a user interface like Bing or ChatGPT. Is that right? I love that it's open source! Way more trustworthy output this way.
(DIR) Post #AUaXjEmWRe7r9tS5Vw by simon@fedi.simonwillison.net
2023-04-12T21:31:41Z
0 likes, 0 repeats
@ironicsans @film_girl no, none of the data in there is generated by a model - it was all manually entered by Databricks staff members
(DIR) Post #AUaXwyj6vxdlQ2POU4 by simon@fedi.simonwillison.net
2023-04-12T21:34:02Z
0 likes, 0 repeats
@Defiance no UI yet but you can call a Python function with a prompt and get back a response, so plugging it into a UI should be pretty easy
(DIR) Post #AUaY8Yf1fizo3I1Gng by Defiance@sfba.social
2023-04-12T21:34:19Z
0 likes, 0 repeats
@simon Thanks!
(DIR) Post #AUasgTPhgY3WhvIn0S by jadp@mastodon.social
2023-04-13T01:26:05Z
0 likes, 0 repeats
@simon @donmelton the fine-tuning data set is open source but I can’t find any mention of the original training set. Do you know anything about that?
(DIR) Post #AUatQWLV7E5L7eIFKS by simon@fedi.simonwillison.net
2023-04-13T01:34:29Z
0 likes, 0 repeats
@jadp @donmelton I believe the training set for Pythia is "The Pile" - some details in the Pythia paper https://arxiv.org/pdf/2304.01373.pdf and on https://pile.eleuther.ai/ - it's 825GB of data from a bunch of sources, most fully described in https://arxiv.org/pdf/2101.00027.pdf
(DIR) Post #AUauOgoBLRKOwkOot6 by jadp@mastodon.social
2023-04-13T01:45:33Z
0 likes, 0 repeats
@simon @donmelton thank you very much for replying
(DIR) Post #AUayn9E7jl0fqqS3WK by bigbee@techhub.social
2023-04-13T02:34:51Z
0 likes, 0 repeats
@simon Databricks was started by the team that developed Spark, and they have a done a lot of work on GPU optimization. You may have better luck firing up a Databricks instance on AWS or Azure.
(DIR) Post #AUbFVaoS8IdvPqev20 by heathr@octodon.social
2023-04-13T05:41:58Z
0 likes, 0 repeats
@simon @pomeranian99 surprising no one, they gave it a female name
(DIR) Post #AUbIi6EtPpOeDkgJNo by callionica@mastodon.social
2023-04-13T06:17:50Z
0 likes, 0 repeats
@simon It’s really interesting to see your research, but my first response was that is a really slow way to search Wikipedia!
(DIR) Post #AUbx3HhlRwBEwNbn96 by simon@fedi.simonwillison.net
2023-04-13T13:49:54Z
0 likes, 0 repeats
@callionica wow, that really is a way closer match then I would have expected! I know Wikipedia is in the "Pile" training set they used but I would expect it to be much more of a rephrasing than that, due to this effect: https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web
(DIR) Post #AUbxEEdJD94AbBkJyC by simon@fedi.simonwillison.net
2023-04-13T13:51:16Z
0 likes, 0 repeats
@callionica the Pile also has a common crawl scrape of the web, I wonder if this was caused because a bunch of other sites republish content from Wikipedia and it ended up in there multiple times, increasing the statistical likeliness it would be combined together like that
(DIR) Post #AUd4DpPRswiNCryaNk by resing@social.coop
2023-04-14T02:44:58Z
0 likes, 0 repeats
@simon very cool. I’ve tried a couple ways of accessing it in azure so far with no success but I’ve never tried huggingface before this
(DIR) Post #AUdUnuSJCOAG5eBBcO by fell@ma.fellr.net
2023-04-14T07:42:41Z
0 likes, 0 repeats
@simon It surprises me, that for something as performance critical as LLMs people use an inefficient language like #Python, where everything, especially GPU access, goes through multiple abstraction layers.
(DIR) Post #AUeTTDbitMeGbx73Cq by resing@social.coop
2023-04-14T19:02:37Z
0 likes, 0 repeats
@simon I found this Google Collab setup that returns the first man on the moon prompt in a second, but only with the short answer. Using the 8GB version. No setup required! https://colab.research.google.com/drive/1A8Prplbjr16hy9eGfWd3-r34FOuccB2c?usp=sharing#scrollTo=qQIBoZHdGF4I
(DIR) Post #AUh3EWUUO4ZDUzwOuG by mori@mastodon.au
2023-04-16T00:52:29Z
0 likes, 0 repeats
@simon This toot got quoted in Ars Technica. Nice work☺️8th paragraphhttps://arstechnica.com/information-technology/2023/04/a-really-big-deal-dolly-is-a-free-open-source-chatgpt-style-ai-model/