[HN Gopher] Fine-tuning Mistral 7B on Magic the Gathering Draft
___________________________________________________________________
Fine-tuning Mistral 7B on Magic the Gathering Draft
Author : dmakian
Score : 209 points
Date : 2023-12-05 16:33 UTC (6 hours ago)
(HTM) web link (generallyintelligent.substack.com)
(TXT) w3m dump (generallyintelligent.substack.com)
| dwrodri wrote:
| It's not the most revolutionary change to our daily lives, but I
| do genuinely look forward to playing against bots that have
| interesting play styles for games like Magic: the Gathering. I
| think this is a clear case where it could drastically improve the
| ability for the R&D team to come up with and test new mechanics
| at different levels of play.
| danbrooks wrote:
| Super interesting that drafts can be represented with LLMs.
|
| The best performing draft AI's I've seen leverage representation
| learning in some form.
|
| See: https://arxiv.org/pdf/2107.04438.pdf
| dmakian wrote:
| I hadn't seen this, this is awesome! You'd think given the
| volume of data available that this type of method would
| outperform an LLM, cool results.
|
| Still some fun things about LLM representations -- you can do
| fun things like give the bots preferences / personality in a
| system prompt which is entertaining!
| rkwz wrote:
| > I was particularly interested in testing models' ability to
| reason (i.e., perform a somewhat complex task that requires high
| context understanding) about out-of-distribution (i.e., unseen)
| data.
|
| I was under the assumption that finetuneing LLMs was useful only
| when you need to change the model's tone (speak like a pirate,
| voldemort etc).
|
| Are there other examples where LLMs were trained to reason a
| particular way?
| minimaxir wrote:
| You can get a standard LLM to change tone just by giving it a
| system prompt/instruction to follow a certain tone.
|
| The only issue there is that sometimes the RLHF seeps through,
| which can be solved by system prompting even harder.
| skerit wrote:
| Aren't a lot of base models fine-tuned with (Q)Lora on
| instruct-based datasets with good results? I thought this was a
| very common practice?
| selfhoster11 wrote:
| Check our Orca. IIRC, it's a technique that aims to encode
| additional logical capabilities into smaller models by having
| larger models generate step-by-step solutions to various
| problems. This doesn't just make them speak more like
| GPT-4/3.5, but is supposedly making them think more like it as
| well.
| dmakian wrote:
| > I was under the assumption that finetuneing LLMs was useful
| only when you need to change the model's tone (speak like a
| pirate, voldemort etc).
|
| A lot of why I tried this out was to test the limits of this
| belief, you see a lot of talk like this out there and it
| sounded like nonsense to me.
|
| Finetuning is fundamentally not much different than continued
| pretraining; if you feed the model high-quality and high-volume
| data I think it's reasonable to expect it to acquire new skills
| oceanplexian wrote:
| In order to speak like a pirate, it has to be able to reason :)
| I've done some fine tunes as well similar to the MTG example,
| in mine I was fine tuning it to speak JSON and reason about
| some input- and yes, you can indeed get these models to perform
| on novel tasks.
| samus wrote:
| Finetuning is a useful workaround for cases when the context
| size is unsuitable for the task at hand. Anybody knows whether
| it was ever considered to finetune an LLM on the Linux kernel
| sources' history and its associated mailing lists?
| dacox wrote:
| Wow, I have exactly the same side project in progress, minus the
| fine tuning part. We even chose the same names and phrasing for
| parts of the project.
| dmakian wrote:
| Would love to compare notes, drop me a email at dshersh at
| umich dot edu if you'd be interested!
| throwaway743 wrote:
| Would like to know, how many matches were won per draft token? If
| it's less than 2, I'll stick to my shitty hand picks :/
| reactordev wrote:
| I like how it identified that you haven't committed to either
| white or blue yet. It was aware of deck _composition_ and not
| just going for the jugular. Keep tuning. It could also be Human-
| bias because you also _played_ the hand. Have someone else draft
| against your LLM and then you play it and see if it 's the same.
| Statistically it should match given enough games.
| freediver wrote:
| Super interesting work. Do you have thoughts how to leverage this
| to create a deck builder AI that would also simulate games? The
| major problem here is that the search space for MTG is amazingly
| vast.
|
| I've seen this effort previously, pretty exciting stuff:
|
| https://www.youtube.com/watch?v=Xq4T44EvPvo
| dmakian wrote:
| I've definitely thought about this problem and think it's in
| the range of 'feasible', but it would be pretty slow and
| expensive given how much context you need to provide a model
| for it to be able to reason about the game state. Worth trying
| though!
| imjonse wrote:
| Confusing name for the domain (Generally Intelligent) since it's
| the former name of a company in the AI/LLM area but does not seem
| to be related.
| matsemann wrote:
| How is the fine tuning actually performed? They have the data of
| drafts, and a prompt. But what does one do with it, more
| concretely?
| dmakian wrote:
| High level it's basically: 1. Generate a lot of text examples
| that look like this:
| https://gist.githubusercontent.com/davidhershey/f57d0b19563f...
|
| 2. The model is effectively trained to predict the next token
| based on the previous tokens in each of these examples, which
| has the side effect here of teaching it to make a draft pick
| based on the contents of a pack.
|
| Nothing too fancy, just next word prediction more or less
| mdaniel wrote:
| In case you didn't see it,
| https://news.ycombinator.com/item?id=38525978 (I hacked Magic the
| Gathering: Arena for a 100% win rate) may interest this audience
| if for no other reason that the investigator discovered that
| Sparky, the pseudo-AI in MTGA, doesn't appear to be as stupid
| complicated as one may have suspected from the outside
| chc4 wrote:
| Sparky is the Arena AI, but no one ever accused it of being a
| _good_ Arena AI - it is very much only there for the new player
| experience of playing against a dumb computer when you 're
| first exposed to the game and don't know the rules, or for the
| computer equivalent of "playing against a goldfish" a deck you
| made to see how it draws or combos. It's not a Chess CPU.
| mdaniel wrote:
| I hope I also did not accuse it of being good, but the
| observation I was trying to make is that -- according to the
| article, I have not myself confirmed the claim -- they run
| the card evaluation logic _and gameplanning_ locally, not in
| a data center full of H100s, which I consider to be quite a
| feat given the free-text-y self-modifying rules of M:TG
| greysphere wrote:
| It would be interesting to compare to training a NN to draft w/o
| the Mistral starting point (both by epoch and by $). It's not
| obvious to me why the LLM component would be relevant. Maybe
| there are enough deck lists or mock drafts on the internet to
| have an influence I suppose. Or maybe 'fine tune an llm' just has
| more infrastructure than 'create a nn'. Maybe we need a nnfiddle
| to make that easier.
| apetresc wrote:
| Without Mistral, how would you get it to generalize to cards it
| hasn't seen before? I assume by "training a NN to draft without
| Mistral" you mean where the input layer is just a bitmapped
| vector of the cards in the pack, right? The killer feature of
| this experiment is that it works on sets the model has never
| seen before and has 0 training data on, using just the text of
| the card. I don't think you can do that without an LLM.
| greysphere wrote:
| That's a good point. It looks like the article hints at some
| success on that front. It'd be interesting to see what that
| means quantitatively. Interesting that this delta could even
| be used as a measure of the llm's value.
|
| I'd be curious about the difference in success w/ drafts on a
| new 2/2 bear with a different name, and cards with a new
| keyword 'fizzbangitude 7' as well.
| filterfiber wrote:
| The benefit of the LLMs is that the checkpoint already
| "understands" a lot by default. Finetuning is relatively cheap
| and makes many tasks such as this one perform decently well
| simply by shoving some data into it.
|
| The base checkpoint takes a lot of compute to make, but that's
| what holds most of it's "knowledge" so to speak.
|
| Making a NN from scratch means you'll have to somehow map the
| cards into inputs. I have limited knowledge of how MTG works,
| but most TGG have text descriptions and complex effects.
| Mapping text to logic is what LLMs are really good at,
| otherwise you're starting from scratch and will also need a
| relatively large amount of compute before it starts displaying
| any type of decent behaviour.
|
| It's also easy for most software devs to do this - finetuning
| mostly consists of collecting text and feeding it into a
| finetuning script. You don't need to know linear algebra, what
| a "convolution" is, etc. to do finetuning.
| apetresc wrote:
| If I'm reading the author's writeup correctly, the prompt he's
| giving the agent at each pick contains only the _names_ of the
| cards in its pool so far, and only gives the full text for the
| cards in the pack it 's being passed. It doesn't look like
| context is being maintained between picks, presumably for context
| window size reasons.
|
| If so, and if he's correct in his assumption that these sets are
| out of the bot's training cutoff window, then surely it's purely
| coincidence if it ends up being a good drafter? The bot would
| have literally no way to know what cards work well with its
| previous picks, what signals have been sent and received in the
| draft so far, etc. Not even the best human player could take (for
| example, from the sample prompt) "Gadwick's First Duel -- {1}{U}
| (uncommon)" and figure out what works well with that (if they've
| never seen the card before).
|
| It would just end up picking generically good draft cards that
| share a color with its previous picks. Which is already what
| pick-order-based heuristics have always done.
| dmakian wrote:
| > If I'm reading the author's writeup correctly, the prompt
| he's giving the agent at each pick contains only the names of
| the cards in its pool so far, and only gives the full text for
| the cards in the pack it's being passed. It doesn't look like
| context is being maintained between picks, presumably for
| context window size reasons.
|
| Not quite -- there's a few ways the model learns the full card
| text:
|
| * The models are trained on card trivia completions as well,
| where they're asked to complete the full text of the card as
| well as information about it (type, CMC, etc.)
|
| * The models do still have to learn next token completion on
| the cards in packs, meaning they learn to predict the full text
| of the cards while making draft picks as well.
|
| Net net, the bots learn the text of the new cards pretty
| comprehensively.
| apetresc wrote:
| Ooh I see! You do that with Mistral7B, I'm guessing? But not
| with the small GPT-3.5 trial you did?
| dmakian wrote:
| The two larger GPT-3.5 trials also got the card trivia
| examples, but like a bad scientist I don't have a great
| control group for those
| apetresc wrote:
| And also, since it seems you're the author, can you also
| clarify if your methodology allowed for the bot to track
| signals outside of the color-identity-count summary
| statistic you pass in the prompt? Something like allowing
| it to notice that a card has wheeled, or that a certain
| synergy piece was passed a few picks ago.
| dmakian wrote:
| Only the statistics you see in the prompt (which are
| clearly limited). I have a lot of ideas about how you
| could improve that context (most likely letting the AI
| record and track notes throughout a draft), but this one
| was relatively simple to implement. Definitely room for
| improvement!
| chc4 wrote:
| Haha, I don't know anything about AI training but that's a
| really cute trick.
| zoogeny wrote:
| I like that this shows how hard even conceptually simple ideas
| are to achieve in fine-tuning LLMs. Even given a pretty good
| starting dataset, a decent starting model, etc. this appears to
| have been a challenge.
|
| One thing it did make me think about was that these models are
| suitable for things that don't have a natural definitive answer.
| That is, picking the perfect card given a set of picks is
| probably combinatorially impossible to solve. But picking a
| _good_ card given a set is possible and LLMs can approach human
| level performance.
|
| I think this leads to a set of problems that current LLMs may be
| fine-tuned to solve.
| dharmab wrote:
| That lines up with my experience- for high-stakes decisions,
| they rarely give me a great answer. But for low stakes
| decisions, they do well at giving me a good enough answer. For
| example, I've been using them to help find gifts for friends
| and children this month. I don't need the best choice to solve
| the problem, just a good one.
| pixl97 wrote:
| How much additional calculation occurs in high-stakes
| decisions by individuals. Also what is the variability in
| quality of high stakes decisions in humans?
|
| I'm guessing LLM decision is rather average, but that the LLM
| has no easy way of spending the extra time to gather
| information around said high stakes decisions like a human
| would.
| falcor84 wrote:
| I wonder if you could define a specific complexity class of
| problems that LLMs are good at
| doctorpangloss wrote:
| > With that data, you can extract "ground truth" by looking at
| the draft picks made by the best players on the service (sorted
| by win rate).
|
| Do you mean that you are looking at the draft picks from
| https://www.17lands.com/leaderboard and then sorting by Win Rate?
| Didn't you mean to choose Match Wins or Trophies? Otherwise,
| you're not measuring the best players on the service. You're
| training on draft choices where most choices were very good -
| i.e., win rate sort will show you the luckiest players, not the
| best ones. That will naturally show up in any validation or
| testing you do too.
|
| Shouldn't this be compared not to an LLM baseline, but to a
| baseline where an "Elo" style score is computed for each card
| compared to others from the 17lands data; then, until you have
| two colors, suggest the best scoring card, or when you do have
| color(s), suggest the best scoring card within that color or a
| land?
|
| I think it is possible for the LLM to have some semblance of
| rules knowledge, but it is more likely that it is picking up on
| card rarity, costs and "Big" more than anything else for unseen
| cards.
|
| Your "accuracy" on the draft seems poor. I'm not sure it means
| what you think it means. Are you saying that when looking at the
| high win rate choices, where all the choices were mostly good,
| you happened to pick the choice that isn't the same as the player
| who originated the data? It actually seems harder to make a
| choice among all good choices.
|
| Anyway, there is quite a bit going on here.
| dmakian wrote:
| > Do you mean that you are looking at the draft picks from
| https://www.17lands.com/leaderboard and then sorting by Win
| Rate? Didn't you mean to choose Match Wins or Trophies?
| Otherwise, you're not measuring the best players on the
| service. You're training on draft choices where most choices
| were very good - i.e., win rate sort will show you the luckiest
| players, not the best ones. That will naturally show up in any
| validation or testing you do too.
|
| Ahh no just unclear in the post, I'm filtering to players in
| 17lands with a > 62% match win rate who are drafting at a high
| ranking (>=diamond rank). I look at all of those players'
| drafts though, even the ones where they do poorly.
|
| > Your "accuracy" on the draft seems poor. I'm not sure it
| means what you think it means. Are you saying that when looking
| at the high win rate choices, where all the choices were mostly
| good, you happened to pick the choice that isn't the same as
| the player who originated the data? It actually seems harder to
| make a choice among all good choices.
|
| Accuracy here is making the same choice from a given pack as
| one of the good players. Obviously subjective so not a perfect
| metric, but a decent check on ability to emulate a high-quality
| drafter.
| doctorpangloss wrote:
| Hmm, but that will filter out more than half the players on
| the Match Wins and Trophies based leaderboards, many of them
| Diamond and Mythic. So I think your choice of 62% match win
| rate is almost certainly disproportionately selecting for
| people who received very good draft choices, even if it
| includes some actually very good players in the data set.
|
| I mean 62% might feel like a good number, but it's arbitrary,
| you'd have to justify how you chose it, and just eyeballing
| it, it is filtering out a lot of very good players with many,
| many more match wins.
|
| Perhaps you can sort by Latest Rank, and filter out people
| with 2 or fewer trophies. Or you will have to validate with
| known bad draft choices in the prompt, to see what it does.
| Suffice it to say, I still don't think the 17Lands data
| represents what you think it does.
|
| Like without a direct discussion about measuring and
| accounting for luck in the draft... for all I know the data
| is seriously flawed. It probably isn't, but it's maybe one of
| many, many issues to address when dealing with strategy card
| game AI problems.
| dmakian wrote:
| Still not clear maybe, I'm selecting players with a 62%
| lifetime win rate so mostly players who have been good over
| a larger number of drafts!
|
| Definitely not perfect data though, and agree that defining
| good in this context is hard -- a lot of the variance of
| "good" depends on how you play the cards either way. All
| good points!
| doctorpangloss wrote:
| > I'm selecting players with a 62% lifetime win rate so
| mostly players who have been good over a larger number of
| drafts!
|
| Hmm, but there are a lot of players with greater than a
| 62% lifetime win rate with very few drafts, but there may
| be many of those players... do you see? The win rate
| isn't a good filter. You chose it, you are trying to
| justify it, and I'm not convinced, not without the hard
| numbers.
|
| I'm not confused about what filter you chose. I just
| think it's a bad filter, and you haven't thought very
| deeply about how it affects the data, which includes
| presumably your test and validation data - however you're
| choosing to test and validate, apparently by hand, by
| some eyeballed examples.
|
| Anyway I think you have to compare with a non-LLM, non-
| random baseline to have any sense if this stuff is
| working at all. I could be dead wrong. I would maybe
| compare with a community draft picker.
| Palmik wrote:
| In ELO like match-making, you typically pair together people
| such that they are likely to have 50% chance to win.
| Therefore as the OP says, filtering down to people with high
| (60+%) life-time win-rate creates some sort of (interesting)
| bias.
|
| I would select from all games played on sufficiently high
| level.
| gigel82 wrote:
| For some reason I thought fine tuning is not possible without
| specialized hardware (A100 / H100). Where can I learn more about
| hardware requirements for fine tuning on consumer GPUs?
| dmakian wrote:
| There is not a lot of great content out there making this
| clear, but basically all that matters for basic fine tuning is
| how much VRAM you have -- since the 3090 / 4090 have 24GB VRAM
| they're both pretty decent fine tuning chips. I think you could
| probably fine-tune a model up to ~13B parameters on one of them
| with PEFT (https://github.com/huggingface/peft)
| mmcwilliams wrote:
| Definitely possible on even older off-the-shelf hardware. I use
| 24GB 4090s for 13b-sized models and have even used 12GB Titans
| for 7b models, admittedly at much slower rates.
| viraptor wrote:
| You can also use Apple silicon for this: https://www.reddit.c
| om/r/LocalLLaMA/comments/15y9m64/fine_tu...
| gigel82 wrote:
| I have a 3080Ti with 12Gb VRAM and would like to try fine
| tuning the same Mistral 7B model (which I found incredibly
| potent). Any tips on how to get started?
| iEchoic wrote:
| Really interesting, thanks for writing this up. I'd love to see
| this applied to actually playing the game, provided that you
| could fit a (long) game state in the context window.
| tayo42 wrote:
| I wonder if you could use a smaller model or get better results
| if you treated each card as a token, gave the state of the draft
| as an input and the predicted token would be the card to pick.
| You woukd have to train from scratch with a custom tokenizer.
| float-trip wrote:
| I tried adding special tokens for a reddit-style dataset once.
| The format was: `<|post_author|>username<|post_title|>title
| here...`
|
| The resulting model was so much worse than just formatting
| everything plaintext. This was with MPT-30B, 15 special tokens,
| 300M training tokens, and a full finetune.
|
| I may have made a mistake, but I haven't seen any open source
| finetunes successfully add a large number of tokens yet either.
| Tostino wrote:
| Try doing the same thing in your dataset, but don't actually
| add them as "special tokens", and just let them just be
| multiple tokens.
|
| Adding new tokens needs a ton of data to train what the token
| means. Reusing existing tokens, will allow you to easily
| teach that a sequence of tokens now has a new meaning after
| fine tuning.
| float-trip wrote:
| That's what I ended up doing (`[Author] username [Title]
| post title...`)
|
| > Adding new tokens needs a ton of data to train what the
| token means.
|
| But how much? 300M tokens is fine for a simple version of
| ChatML with ~4 tokens. Not for 15, at least in my case.
| How's this relationship scale?
|
| Just trying to offer one datapoint for what doesn't work,
| with the hedge that I might have just had a bug
| tayo42 wrote:
| I don't mean add special tokens, but make the vocab only the
| set of possible cards. each card is a token.
|
| a simple input might be <cards you hold> 1 14 56</end><cards
| to pick> 5 64 2</end> -> predicted token is the draft pick.
|
| Then train a transformer based network from scratch.
| 8f2ab37a-ed6c wrote:
| Thanks for sharing this, I found it helpful as an addition to my
| homebrew curriculum for learning how to fine-tune open source
| LLMs.
| objektif wrote:
| Can you please point me to good resources on fine tuning?
| Thanks.
| amrrs wrote:
| Check out https://github.com/OpenAccess-AI-Collective/axolotl
| 8f2ab37a-ed6c wrote:
| Search for articles showing you code for fine-tuning Llama 2,
| ideally including a colab notebook that you can run and
| modify yourself so that you have real code to work with. You
| can try to modify their working example to suit your own toy
| project as a first step.
| float-trip wrote:
| Thanks for writing up. Rather than zeroing out the loss for the
| prompt, did you also try using weighted loss with Axolotl? At one
| point, Microsoft's GPT 3 docs suggested this was beneficial when
| the responses are short (like you have with "Cut in.") Domain
| adaptation over subreddits/forums before finetuning may help as
| well.
| dmakian wrote:
| > did you also try using weighted loss with Axolotl
|
| This is really smart, I didn't think about this! Will add it to
| my list of things to try, great idea!
|
| > Domain adaptation over subreddits/forums before finetuning
| may help as well.
|
| I was thinking about this too (along with transcribing draft
| youtube videos), I'd definitely be curious how much this helps.
| rgbrgb wrote:
| > I ended up renting an hourly GPU from Runpod (an RTX 4090 w/
| 24GB of VRAM) for ~$0.7/hr.
|
| Sorry if I missed this, but how much did it cost total to do the
| fine-tune? Is that the 40 hour number (~$27)?
|
| Also, very cool writeup. Thanks for sharing!
| dmakian wrote:
| The longest running fine tuning job took about 8 hours, so ~$5.
|
| I think if you add up all of the learning and testing I did,
| probably closer to ~$50 total
| sva_ wrote:
| Hmm, is "Generally Intelligent" related to the company that
| previously had that name, but renamed itself to "Imbue"? Sort of
| confused.
|
| https://www.ycombinator.com/companies/imbue
| lubutu wrote:
| Lurrus into Dead Weight -- that's a nice start.
___________________________________________________________________
(page generated 2023-12-05 23:00 UTC)