[HN Gopher] LLM Python library now provides tools for working wi...
___________________________________________________________________
LLM Python library now provides tools for working with embeddings
Author : simonw
Score : 18 points
Date : 2023-09-04 20:37 UTC (2 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| haxton wrote:
| Curious to know what value you've seen out of these clusters. In
| my experience k means clustering was very lackluster. Having to
| define the number of clusters was a big pain point too.
|
| You almost certainly want a graph like structure (overlapping
| communities rather than clusters).
|
| But unsupervised clustering was almost entirely ineffective for
| every use case I had :/
| simonw wrote:
| I only got the clustering working this morning, so aside from
| playing around with it a bit I've not had any results that have
| convinced me it's a tool I should throw at lots of different
| problems.
|
| I mainly like it as another example of the kind of things you
| can use embeddings for.
|
| My implementation is very naive - it's just this:
| sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")
|
| I imagine there are all kinds of improvements that could be
| made to this kind of thing.
|
| I'd love to understand if there's a good way to automatically
| pick an interesting number of clusters, as opposed to picking a
| number at the start.
|
| https://github.com/simonw/llm-cluster/blob/main/llm_cluster....
| haxton wrote:
| Elbow method is a good place to start for finding the number
| of clusters.
| simonw wrote:
| There's a lot of stuff in this release.
|
| Don't miss the new llm-cluster plugin, which can both calculate
| clusters from embeddings and use another LLM call to generate a
| name for each cluster: https://github.com/simonw/llm-cluster
|
| Example usage:
|
| Fetch all issues, embed them and store the embeddings and content
| in SQLite: paginate-json 'https://api.github.co
| m/repos/simonw/llm/issues?state=all&filter=all' \ | jq
| '[.[] | {id: .id, title: .title}]' \ | llm embed-multi
| llm-issues - \ --database issues.db \
| --model sentence-transformers/all-MiniLM-L6-v2 \
| --store
|
| Group those in 10 clusters and generate a summary for each one
| using a call to GPT-4: llm cluster llm-issues
| --database issues.db 10 --summary --model gpt-4
| quickthrower2 wrote:
| I would change the title to: Python Library
| "llm" now provides tools for working with embeddings
|
| I initially was trying to parse that, thinking "is this an open
| AI thing?". Of course the answer is just a click away, but people
| might miss this if they are interested in Python coding and AI.
| dang wrote:
| OK, we've put Python library up there.
| simonw wrote:
| Looks like you missed my reply by seconds pointing out that
| it's not just a Python library, it's also a CLI tool:
| https://news.ycombinator.com/item?id=37385788
| quickthrower2 wrote:
| Aah! Sorry about that both of you. I didn't think dang
| would see this and simon would update the title and sanity
| check it.
| simonw wrote:
| It's not just a Python library though: it's also a CLI tool.
|
| I put a bunch of work into getting it into Homebrew so that
| people who aren't Python developers can "brew install llm" and
| start using it.
|
| Details on the CLI here:
| https://llm.datasette.io/en/stable/usage.html and
| https://llm.datasette.io/en/stable/embeddings/cli.html
| thatcherthorn wrote:
| This is a fantastic library. I plan to use some of the search
| functionality with a system that tries to figure out how to
| manipulate/work with/add features to existing code.
___________________________________________________________________
(page generated 2023-09-04 23:00 UTC)