[HN Gopher] Show HN: Pykoi - a Python library for LLM data colle...
___________________________________________________________________
Show HN: Pykoi - a Python library for LLM data collection and fine
tuning
Hi HN, pykoi is an open-source python library for ML scientists.
pykoi makes it easier to collect data for LLMs, to use that data
for finetuning, and to compare models to each other (e.g. your
model pre- and post- finetuning, or your model vs openai vs
claude). The library comes from pain points we experienced in LLM
development: 1. Collecting feedback data from users isn't as easy
as it could be. (The current process usually involves sharing excel
files of annotated responses back-and-forth, offering no insight
into how users actually engage with your models). 2. RLHF remains
complicated to carry out. By _complicated_ , we mean requires a lot
of steps, hundreds of configs, lengthy setups, etc. 3. Comparing
models to each other _as they 're used_ (that is, independent from
academic metrics) is full of friction. The current approach: spin
up a model, ask questions, write them down. Repeat for other models
then compare. At a high-level, we think that the active learning
process should be closed-loop: data collection, fine tuning, and
inference all feed from the same system. This library is our first
step in that direction. The project is still very early but we
hope that some if it is useful. Note, we're fully open-source, and
actively adding features! Website: https://www.cambioml.com/pykoi
GitHub: https://github.com/CambioML/pykoi We would love your
feedback!
Author : jaredwilber
Score : 49 points
Date : 2023-08-11 17:12 UTC (5 hours ago)
(HTM) web link (www.cambioml.com)
(TXT) w3m dump (www.cambioml.com)
| lmeyerov wrote:
| i was curious b/c we're building a lot of this inhouse for
| louie.ai just out of need
|
| using the current seems unclear for us:
|
| * we need to own the data & database, and align with our
| regular+vector infra -- where do they live here?
|
| * we spend a lot of time on security annotations as the data
| isn't just for training but feeding back live in RAG, and in both
| cases, need rich expressivity for partitioning for sharing&tuning
| between different users/teams.. this seems to assume one big
| pile?
___________________________________________________________________
(page generated 2023-08-11 23:00 UTC)