[HN Gopher] NanoGPT
___________________________________________________________________
NanoGPT
Author : trekhleb
Score : 1270 points
Date : 2023-01-11 08:34 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cpdomina wrote:
| To train small gpt-like models, there's also aitextgen:
| https://github.com/minimaxir/aitextgen
| minimaxir wrote:
| As the creator of aitextgen, I'm mixed on continuing support
| since there doesn't seem to be as much demand as expected for
| _small_ GPT models given the success and cost-effectiveness of
| GPT-3 /ChatGPT, unfortunately.
|
| I still have a few ideas there (including another secret
| approach at better text generation) but it's hard to determine
| ROI.
| mboof wrote:
| I think what you have created still has great demand. It give
| devs who do not have the budget or need for the gigantic
| models, something to train and use for their own specific
| language tasks.
|
| Not everyone is trying to replicate CHATGPT results for
| certain tasks.
| jamesfisher wrote:
| For casual readers like me: are there examples of what this can
| do once trained? E.g. it mentions training on Shakespeare, but
| gives no examples of fake Shakespearean.
| naasking wrote:
| The repo seems to imply that it matches GPT-2, so I imagine any
| analyses of GPT-2 will give you a good idea.
| kwerk wrote:
| I'm not easily finding GPT-2 use cases. Any query guidance?
| visarga wrote:
| The GPT family of models shines above 100B parameters.
| Almost nobody uses GPT2 today. It's too weak.
|
| If you want to go with <1B model, you use a BERT which is
| bidirectional or a T5 that is easier to fine-tune on other
| tasks.
| fredoliveira wrote:
| Something that immediately comes to mind is text
| summarization. You'll by now be used to better results from
| GPT-3 or recent models, though.
| programmarchy wrote:
| Does anyone know the main differences between GPT-2 and
| GPT-3? Are there significant architectural changes, or is the
| advancement primarily from training?
| naasking wrote:
| If you google "GPT-2 vs GPT-3" you'll find lots of
| overviews and comparisons, like:
|
| * https://www.kdnuggets.com/2021/02/gpt2-gpt3-openai-
| showdown....
|
| * https://bakztfuture.substack.com/p/the-chasm-between-
| gpt-2-a...
| programmarchy wrote:
| Thanks. Sounds like they 10x'ed the number of parameters,
| which made some "magic leap" that isn't yet well
| understood, and fed it more data to train it on more
| specialized domains.
| naasking wrote:
| Yes, although Chinchilla seems to imply that training
| data size matters a lot more than parameter count, and
| nanoGPT author is trying to reproduce that here:
|
| https://github.com/karpathy/nanoGPT/blob/master/scaling_l
| aws...
| karpathy wrote:
| I was also a bit surprised that the Chinchilla numbers
| and tables don't reproduce and that there are calculation
| bugs in the paper (e.g. the FLOPs calculation in the
| paper is wrong), especially because the paper has been so
| impactful in the field. Maybe people are focusing on the
| broad themes of the paper (e.g. scale model and data
| approx. in tandem) and just roughly interpolating the
| main Figure, without sweating the details. The
| corresponding authors responded very kindly at first and
| I was able to bring the results closer but now they went
| dark. Still hoping to make things match, if others in LLM
| space can spot any issues in my own reproduction please
| let me know.
| programmarchy wrote:
| Oh, that's really interesting, and makes sense
| intuitively. From the abstract:
|
| > We find that current large language models are
| significantly under-trained, a consequence of the recent
| focus on scaling language models whilst keeping the
| amount of training data constant ... the model size and
| the number of training tokens should be scaled equally:
| for every doubling of model size the number of training
| tokens should also be doubled.
|
| Assuming the GPT-3 authors know this, one could surmise
| they 10x'ed the number of training tokens also.
|
| Edit: Should have kept reading. Sounds like GPT-3 was
| found to be undertrained.
| aravindgp wrote:
| Thank you Andrej Karpathy for the work on ai and gpt models. It
| really helped me solve a problem as entrepreneur. I started
| making first few grand from ai.
| srge wrote:
| May i ask how? Consulting?
| imranq wrote:
| I would love to see a minInstructGPT or a minRetro, or maybe
| something that combines instruction and retrieval into a readable
| codebase!
| sharemywin wrote:
| To me this is the important quote:
|
| Unlike OpenWebText this will run in seconds. Finetuning takes
| very little time, e.g. on a single GPU just a few minutes. Run an
| example finetuning like:
| bilsbie wrote:
| Really cool. Can anyone answer these questions:
|
| Should I use this or minGPT?
|
| It says it needs 8XA100 40GB node. What is that and where do I
| acquire it?
|
| Could someone else train this and then send me the model? What
| would be required to run it as opposed to training it?
| vanpelt wrote:
| A100's are Nvidia GPU's. You can rent them from providers like
| AWS or LamdaLabs. The readme has instructions for downloading
| the original GPT2 weights from OpenAI. You can also train a
| very simple version on a smaller dataset from your laptop as
| described in the README.
|
| If you just want to play with a similar but much better model
| goto https://chat.openai.com
| nprateem wrote:
| If I trained this on a 30,000 word document could it give me a
| summary? Or would there be no need to train it in that case, and
| I could just tell it "Summarise this: <insert 30,000 word
| document>"?
| londons_explore wrote:
| The context window (block size) of this model is 1024 symbols.
| Symbols approximately map to words. So you can't ask it to
| summarize anything over 1024 words.
| nprateem wrote:
| Yeah that's the issue I was thinking of, how to get it to
| summarise large documents. Has anyone any ideas?
| londons_explore wrote:
| People have had some success with the following process:
|
| Divide your 30,000 word document into a hundred 300 word
| chuncks. For each chunk, give as input:
| Please summarize the following text into 50 words:
| [chunk]
|
| Join all the outputs together, and you now have a shorter
| document. Repeat the process recursively.
|
| You can improve the results by doing the process again, but
| this time giving some context: Please
| summarize the following text, an extract of a document
| about [1st attempt at a summary], into 50 words:
| [chunk]
| londons_explore wrote:
| You can also use "Please suggest a section title for the
| following text".
|
| Then that title can be used in the 2nd round, for example
| using a query of the form "The following is an extract
| from the _Introduction_ section of a document about _The
| benefits and disadvantages of nuclear power in sweden_ :"
| generalizations wrote:
| I imagine you could do even better by finetuning the
| neural net on the document before asking for the
| recursive summary. Then it has all the information to
| work with, albeit in a compressed form.
| londons_explore wrote:
| 30,000 words wouldn't be enough to train this from scratch -
| you'd ideally train from hundreds of millions of words at
| least.
|
| 30,000 words _would_ be enough to _finetune_ an existing model.
| If you did that, then the model would output text similar to
| the finetuning data. For example, if you finetuned it on
| shakespeare, then you might be able to use the model to make a
| new play, in shakespeare 's style.
| ProjectArcturis wrote:
| If you finetuned it on the text of Shakespeare's plays, how
| would it link that text to the string "Shakespeare"?
| londons_explore wrote:
| It still has the knowledge from the main training on data
| from across the whole internet, so would still know the
| word Shakespeare...
|
| But you're right - the model finetuned on shakespeare would
| be good at writing a new play in the style of shakespeare,
| but would be bad at giving a critique of shakespeare's
| works.
| gpt-4 wrote:
| Is there a list of datasets like
| https://skylion007.github.io/OpenWebTextCorpus/ ?
| grogenaut wrote:
| Somewhat off topic, does someone know how bing might integrate
| chat gpt into search. Is it to understand the prompt and filter
| results. Taking the question and summarizing it to search the
| index. Is it to summarize all the documents into an index and
| search that. Or to just be like chat gpt is now and use it to
| generate new results from it's knowledge base? I'm trying to
| connect the dots between a generative form like these are and how
| it would influence search in the future. Or is the lucene style
| index search on it's way out in a generative world?
| legutierr wrote:
| > The code itself is plain and readable: train.py is a ~300-line
| boilerplate training loop and model.py a ~300-line GPT model
| definition, which can optionally load the GPT-2 weights from
| OpenAI. That's it.
|
| What's the best source for these weights?
| benjamincburns wrote:
| Kaggle or HuggingFace
| siquick wrote:
| Excuse my ignorance but what can a layman do with this?
| taneq wrote:
| Become less lay?
| cs702 wrote:
| Andrej doesn't _need_ to do this.
|
| He's done it because he evidently _loves_ it, and wants to
| _share_ his hard-earned knowledge with the rest of the world.
|
| He may be a product of the ivory tower, but he's been _in the
| trenches_. He knows firsthand how f-ing hard it is to ship a
| product.
|
| And here he is, sharing useful _personal code_ with everyone.
|
| This github repo now has collected ~4K stars and its antecessor
| (minGPT) has collected ~11K stars over the past couple of years.
| In my experience, the number of people who clone, copy, view or
| otherwise use code from a repo is one to two orders of magnitude
| larger than the number of people who star it, so we can safely
| say that Andrej has helped at least a few hundred thousand -- but
| likely more than a million -- individuals around the world learn
| how to build and tinker with GPT models.
|
| Remarkably, as I write this, no one else here has said thank you
| yet, so let me say it on everyone's behalf:
|
| THANK YOU ANDREJ.
|
| --
|
| EDITS: I changed the wording in response to latexr's comments
| below.
| canadianfella wrote:
| [dead]
| LeoPanthera wrote:
| I love _italics_. They 're _good_.
| cs702 wrote:
| In hindsight, yes, I may have overused them out of
| excitement. Sorry! :-)
| mrg3_2013 wrote:
| Thoughtful post! Everything so true! I am always amazed by
| individuals who truly are educators of the world.
| isoprophlex wrote:
| Pedantry time!
|
| A million people building GPT models means that one in 8000
| humans on earth has built one. That seems wildly off.
|
| Linkedin has about 100.000 profiles of data scientists. Assume
| generously that the actual number is 10x higher. Not correcting
| for the fact that a data scientist isn't always a machine
| learning expert, etc etc, there's just no way every single one
| of them even KNOWS what a GPT-like model is.
| cs702 wrote:
| Not only building. Also tinkering, using, testing out of
| curiosity, etc. There are around ~30 million software
| developers worldwide today (source: googled it). Around ~7
| million of them are registered users of Github (source:
| googled it). 1M+ seems likely to me.
|
| BTW, I appreciate that you preceded your comment with
| "Pedantry time!" -- nice gesture :-)
| latexr wrote:
| Edit: the OP has updated their wording to make it clear they
| meant any kind of viewing or usage. I don't think any of us
| would disagree more people use code than star repos. Original
| comment left below with original quote, since this has gotten a
| number of replies that would stop making sense with a larger
| edit.
|
| > Normally, the number of people who clone or copy code from a
| repo is one to two orders of magnitude larger than the number
| of people who take the time to star it
|
| Intuitively, I'm having trouble believing that. Starring takes
| _considerably_ less effort than cloning or copying code. The
| "time to star" is a literal second, maybe two if you have to
| scroll up.
|
| From anecdotal observation, repos with more forks and/or
| external contributors than stars are far from the norm. I've
| seen many mentioning they star repos as a way of bookmarking
| they seldom go back to, or as an easy way to send kudos to the
| developer even when they don't use the project.
|
| In no way is this a comment on the value of Andrej's work (I'm
| not familiar with it). I am only interested in the source of
| your "orders of magnitude" claim, which if proven will update
| my mental model of the coding community.
| chirau wrote:
| How many projects have you starred and how many have you
| cloned?
|
| Whilst starring is simpler, the incentive is much lower than
| that of cloning. Especially for projects you just want to use
| and not contribute to or follow.
|
| In my many years of work, i have only starred less than 50
| repos. I am sure i have cloned more than a thousand.
| latexr wrote:
| > How many projects have you starred and how many have you
| cloned?
|
| I seldom star, but neither you nor I can be extrapolated to
| the general community. I have thousands of stars in some
| repos, and I know a significant number of those users don't
| even code, let alone clone repos or copy code, they're
| interested in the final product. They have GitHub accounts
| because it's the way to report bugs or make feature
| requests.
|
| The OP made a claim. All I'm looking to understand is if it
| has data backing it up or it's just a gut feeling, because
| if it's the former I'll have learned something and made a
| correction of my mental model of the world. Sharing more
| anecdotes will leave us stuck in the same situation.
| wongarsu wrote:
| If I want to use a repository, my first step is to either
| download a released binary or clone the repository. Forking
| is much further down the line for me, when I've used the
| code, encountered a problem, fixed it, and decided to polish
| the fix up to make a PR. I star something when I either have
| used it and like it, or when I think I want to use it in the
| future and want to bookmark it (though the former more often
| than the latter). I have given out about 50% more stars than
| I've forked, and have probably cloned an order of magnitude
| more than I've forked or starred.
|
| Of course not everyone is the same, but I'd be surprised if
| overall clones were less than an order of magnitude more than
| forks or stars, and find two or even three orders of
| magnitude believable depending on the target group of the
| repo.
| cs702 wrote:
| _Exactly._ I would add that the number of clones (not
| forks) and file /page views is viewable only by the owner
| of the repo, so we can only guess. (If you own a github
| repo, you can see the most recent number of clones and page
| views by clicking on insight -> traffic.)
|
| My estimate of "one to two orders of magnitude" is based on
| anecdotal evidence. I edited my comment to reflect as much.
| sjadoinqwoeihad wrote:
| I checked my 5 year old repository of ~300 stars. It has a
| ~100 unique clones a month. So if the average was half of it
| then the 1 order of magnitude would be quite an accurate
| approximation.
|
| I think the biggest difference with a clone and a star is
| that a star requires an account and some vested interest in
| the social network of Github. Anyone who is not interested in
| the social aspect can just bookmark it.
|
| I guess this differs quite a lot by target demographic. A
| tool for GPT will probably get a lot more stars than a plugin
| for some consumer software simply because it is more targeted
| for the audience of people who have Github accounts.
| [deleted]
| cs702 wrote:
| Thank you for sharing your anecdata. In my experience, the
| number of clones per month is much higher at first, and
| then decays gradually until it settles into a stable run-
| rate, so it's likely that you've had _more than_ 100 x 12 x
| 5 clones over those five years -- i.e., between one and two
| orders of magnitude the number of stars, 300.
| jefftk wrote:
| Another data point: icdiff is 13y old with 4k stars and
| 200 unique clones in the past month.
|
| (This is a tool that most people install and run without
| any interaction with GitHub, since it is in package
| managers)
| londons_explore wrote:
| Some repos have code that 'phones home' when run. For
| example, checking for updates or security vulnerabilities.
|
| By checking the usage statistics on that server, you can get
| an idea how many users there are, and typically it's far
| higher than the number of stars.
| latexr wrote:
| That just tells us that more people _use_ the code than
| star the repo. I don't think that'd be a surprise to
| anyone. The claim was that more people clone and copy code
| from the repo than the ones who star it, which is a
| different matter from the number of users.
| cs702 wrote:
| Thank you for clarifying. I meant _use_. The number of
| clones and the number of file /page views are proxies for
| that. So is the number of installs via pip, conda, and
| other Python package management systems, in this case. I
| updated my comment to reflect as much.
| hahamrfunnyguy wrote:
| I've stared maybe 2-3 repositories over the past 15 years,
| contributed to probably a half dozen and used hundreds (if
| not more) in my applications. To me using means using that
| project in an application you develop. Typically I get them
| from NPM or Nuget and I contribute when a) the project owner
| thinks my feature idea is a good idea or b) I run into a bug
| that I can fix.
|
| Starring is just not that useful to me so I can see why users
| or contributors would be much higher. I typically star repos
| if it's an unpopular or old repository that doesn't have NPM
| or Nuget packages.
| [deleted]
| adam_arthur wrote:
| I'm all for thanking open source contributors, but your
| excessively prostrating wording is a bit much for me.
| cs702 wrote:
| If I overdid it, I'm sorry. I promise it wasn't intentional.
| My comment was spur-of-the-moment, motivated by sincere
| gratitude :-)
| 1986108 wrote:
| [flagged]
| idiotsecant wrote:
| can't tell if you're making some kind of clever quip or if
| this is some random spambot just entering a random reverse
| DNS lookup line.
| tomComb wrote:
| Him doing this is not like when your average bloke does it.
|
| He appears to be building a business and maintaining his
| profile. And there is nothing wrong with that - I admire him
| for for pursuing his career in this positive and helpful way.
|
| But random folks do this sort of thing everyday with no such
| career goals and little recognition, so I'm not sure it is this
| specific contribution that needs to be called out.
| krisoft wrote:
| > I'm not sure it is this specific contribution that needs to
| be called out.
|
| I go the other way. I would like to thank anyone who releases
| open source code, whether they cause big ripples or not.
| modeless wrote:
| What business is he building?
| homarp wrote:
| see also 'Cramming: Training a Language Model on a Single GPU in
| One Day' https://arxiv.org/abs/2212.14034 and
| https://github.com/JonasGeiping/cramming
| waiseristy wrote:
| So, are there any of these projects that aren't vendor locked to
| NVIDIA and are able to train large models with limited GPU RAM
| space?
|
| I don't mind letting my machine churn for 2-3 weeks. But I'm not
| looking to buy another 1000$ GPU just because CUDA is the only
| compute library researchers understand
| jgalt212 wrote:
| So is MSFT now extra grossly overpaying for ChatGPT?
| surume wrote:
| Thank you so much for this! It is so impressive and I'm sure it
| took a lot of hard work!
|
| Is it able to re-write articles? And where could I find a guide
| on how to train it?
| arturventura wrote:
| This is really good, and I was really excited by it but then I
| read:
|
| > running on a single 8XA100 40GB node in 38 hours of training
|
| This is a $40-80k machine. Not a diss, but I would love to see an
| advance that would allow anyone with a high end computer to be
| able to improve on this model. Before that happens this whole
| field is going to be owned by big corporations.
| windexh8er wrote:
| But how often do you need to run this? You can run 8xA1000 on
| LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be
| able to run the entire data set for less than $350.
|
| [0] https://lambdalabs.com/service/gpu-cloud#pricing
| throwawaymaths wrote:
| They are acknowledged at the bottom for supporting andrej's
| research!!
| anigbrowl wrote:
| Well, he does include instructions for running it on a personal
| computer, which looks like what I'm gonna be doing next week.
|
| Besides the rental options discussed below these nvidia boxen
| don't look too big so either used ones will be available for
| cheap relatively soon, or you could just locate and liberate
| one in Promethean fashion.
| ProjectArcturis wrote:
| That's to train it from scratch, though, right? If you preload
| the GPT2 weights you don't need to do this. You can just give
| it additional training on your texts.
| aidos wrote:
| I don't know anything about this, but is that this instance
| type on AWS? p4d.24xlarge
| Tepix wrote:
| If you can fit the training into 24GB, a used RTX 3090 for
| $700-$800 seems like a good deal at the moment. They are about
| 45-65% as fast as the A100 according to https://bizon-
| tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...
|
| So if you buy two of these cards it will take 12-13 days
| instead of 38 hours but only require a $2500 PC.
|
| James Betker, who created tortoise TTS, built his own $15k
| machine with 8x RTX 3090 and trained the models with it. He now
| works for OpenAI...
| jph00 wrote:
| A couple of weeks ago a new paper came out that shows how to
| train a high quality language model on a single GPU in one day.
|
| https://arxiv.org/abs/2212.14034
| wongarsu wrote:
| It's a $33/hour machine on AWS, so about $1250 for one training
| run. Not cheap, but easily in the reach of startups and
| educational or research institutions.
|
| Edit: or about $340 if you get the 8xA100 instance from
| lambdalabs, in the realm of normal hobby spending
| belter wrote:
| Or $9/hour if you use Spot :-)
|
| https://aws.amazon.com/ec2/spot/pricing/
| snerbles wrote:
| Hopefully your progress gets saved in time when the spot
| instance inevitably gets terminated in the midst of
| training.
| acetabulum wrote:
| If you use Horovod Elastic, I think you can avoid this
| problem working across a cluster of Spot instances.
|
| https://horovod.readthedocs.io/en/stable/elastic_include.
| htm...
| belter wrote:
| "Managed Spot Training..."
|
| "...Spot instances can be interrupted, causing jobs to
| take longer to start or finish. You can configure your
| managed spot training job to use checkpoints. SageMaker
| copies checkpoint data from a local path to Amazon S3.
| When the job is restarted, SageMaker copies the data from
| Amazon S3 back into the local path. The training job can
| then resume from the last checkpoint instead of
| restarting...."
|
| https://docs.aws.amazon.com/sagemaker/latest/dg/model-
| manage...
| bobbyi wrote:
| If you're doing something new/ custom (which you presumably
| are if you aren't using someone else's prebuilt model), it
| could take a lot of runs to figure out the best training data
| and finetune settings.
|
| (I assume. I've never worked with GPT, but have done similar
| work in other domains).
| weird-eye-issue wrote:
| After training don't you have to keep it running if you want
| to use it?
| wongarsu wrote:
| Just download the model and run it on something much
| smaller and cheaper. Bigger models like GPT-J are a bit of
| a pain to run, but GPT2-sized models run just fine on
| consumer GPUs.
| bilsbie wrote:
| What's required to run the model?
| wongarsu wrote:
| The biggest GPT2 (1.5B params) takes about 10GB VRAM,
| meaning it runs on a RTX 2080 TI, or the 12GB version of
| the RTX 3080
| renewiltord wrote:
| What's the largest language model I can run on a 3090
| with 24 GiB RAM?
| lossolo wrote:
| Depends on precision, you can run ~5B model with fp32
| precision or ~11B fp16 model max. Int8 is really bad for
| real world use case so not mentioning it.
|
| But if you are looking to get performance of ChatGPT or
| GPT-3 then don't waste your time, all GPT-3 like small
| LLM models (below at least 60B params) are useless for
| any real world use case, they are just toys.
| renewiltord wrote:
| Okay, thank you. Perfect response.
| haldujai wrote:
| If you specifically mean a general LLM trained on a
| general language corpus with instruction finetuning this
| is correct.
|
| Fortunately very few real world use cases need to be this
| general.
|
| If you are training a LLM on a domain specific corpus or
| finetuning on specific downstream tasks even relatively
| tiny models at 330m params are definitely useful and not
| "toys" and can be used to accurately perform tasks such
| as semantic text search, document summarization and named
| entity recognition.
| lossolo wrote:
| > If you specifically mean a general LLM trained on a
| general language corpus with instruction finetuning this
| is correct.
|
| Yes, that's what I meant.
|
| > If you are training a LLM on a domain specific corpus
| or finetuning on specific downstream tasks even
| relatively tiny models at 330m params are definitely
| useful and not "toys" and can be used to accurately
| perform tasks such as semantic text search, document
| summarization and named entity recognition.
|
| Agree, BERT family is a good example here.
| JustSomeNobody wrote:
| https://github.com/karpathy/nanoGPT#i-only-have-a-macbook
|
| > This creates a much smaller Transformer (4 layers, 4 heads,
| 64 embedding size), runs only on CPU, does not torch.compile
| the model (torch seems to give an error if you try), only
| evaluates for one iteration so you can see the training loop at
| work immediately, and also makes sure the context length is
| much smaller (e.g. 64 tokens), and the batch size is reduced to
| 8. On my MacBook Air (M1) this takes about 400ms per iteration.
| The network is still pretty expensive because the current
| vocabulary is hard-coded to be the GPT-2 BPE encodings of
| vocab_size=50257. So the embeddings table and the last layer
| are still massive. In the future I may modify the code to
| support simple character-level encoding, in which case this
| would fly. (The required changes would actually be pretty
| minimal, TODO)
| anilshanbhag wrote:
| If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 /
| chatGPT needs!
| Gigachad wrote:
| Supposedly even running the trained model for ChatGPT is
| extremely expensive unlike the image generators which can
| largely be run on a consumer device.
| haldujai wrote:
| If you can't fit the model on your resources you can leverage
| DeepSpeed's ZeRO-offload which will let you train GPT2 on a
| single V100 (32gb).
|
| Alternatively, if you're researching (with the caveat that you
| have to either publish, open source or share your results in a
| blog post) you can also get access to Google's TPU research
| cloud which gives you a few v3-8s for 30 days (can't do
| distributed training on devices but can run workloads in
| parallel). You can also ask nicely for a pod, I've been granted
| access to a v3-32 for 14 days pretty trivially which (if
| optimized) has more throughput than 8xA100 on transformer
| models.
|
| TPUs and moreso pods are a bit harder to work with and TF
| performs far better than PyTorch on them.
|
| https://www.deepspeed.ai/tutorials/zero-offload/
|
| https://medium.com/analytics-vidhya/googles-tpu-research-clo...
| dceddia wrote:
| I was curious about how much this would be to rent, because
| definitely the cost of those servers is outside the budget!
| Lambda has 8xA100 40gb for $8.80/hr:
| https://lambdalabs.com/service/gpu-cloud#pricing
| Tenoke wrote:
| It seems as likely as people being able to build big automaker
| level of cars just with tools in their garage. More compute is
| going to keep producing better results at least for LLMs.
| base698 wrote:
| You can rent on AWS and other cloud providers.
| liquidk wrote:
| That is a key difference. You can't easily and cheaply rent
| an auto factory, but you're starting to be able to rent an
| LLM training factory once for a model where you can then more
| cheaply run inference on.
| krisoft wrote:
| So if I see it right that would be a p4d.24xlarge instance.
| Which goes for about $32.77 an hour nowadays so the total
| training would be about $1245. Not cheap, but certainly not a
| nation state budget.
|
| Edit: i just noticed lambda lab. It seems they ask $8.8 per
| hour for an instance of this caliber. That puts the total
| training cost around $334. I wonder how come it is that much
| cheaper.
| pavlov wrote:
| I don't know if that's a blocker. Ordinary people commonly rent
| a $40k machine for 38 hours from companies like Avis and Hertz.
|
| If training a large model now costs the same as driving to
| visit grandma, that seems like a pretty good deal.
| [deleted]
| jetrink wrote:
| That's a great comparison. For a real number, I just checked
| Runpod and you can rent a system with 8xA100 for $17/hr or
| ~$700 for 38 hours. Not cheap, but also pretty close to the
| cost of renting a premium vehicle for a few days. I've
| trained a few small models by renting an 1xA5000 system and
| that only costs $0.44/hr, which is perfect for learning and
| experimentation.
| willseth wrote:
| The good news is that, unlike vehicles, the rate for rented
| compute will continue to drop
| amelius wrote:
| It would be great if a tradeoff could be made, though. For
| example, train at 1/10th the speed for 1/10th of the cost.
|
| This could correspond to taking public transport in your
| analogy, and would bring this within reach of most
| students.
| mcbuilder wrote:
| Well if it used to cost you $1 for 1hr at 1x speed, now
| it will take you 10hr at 0.1x speed, and if my math
| checks out $1. You need to shrink the model.
| amelius wrote:
| But of course now you run it on your own computer instead
| of in the DC, which changes the numbers. Especially if
| your student dorm has a shared electricity bill :)
| mk_stjames wrote:
| The problem with that is currently, the available memory
| scales with the class of GPU.... and very large language
| models need 160-320GB of VRAM. So, there sadly isn't
| anything out there that you can load up a model this
| large on except a rack of 8x+ A40s/A100s.
|
| I know there are memory channel bandwidth limits and
| whatnot but I really wish there was a card out there with
| a 3090 sized die but with 96GB of VRAM solely to make it
| easier to experiment with larger models. If it takes 8
| days to train vs. 1, thats fine. having only two of them
| to get 192GB and still fit on a desk and draw normal
| power would be great.
| buildbot wrote:
| Technically this is not true- there are a lot of
| techniques to shard models and store activation between
| layers or even smaller subcomponents of the network. For
| example, you can split the 175B parameter bloom model
| into separate layers, load up a layer, read the prev.
| layers input from disk, and save the output to disk.
|
| And NVIDIA does make cards like you are asking for - the
| A100 is the fast memory offering, the A40 the bulk slower
| memory (though they added the 80GB A100 and did not
| double the A40 to 96GB so this is less true now than the
| P40 vs P100 gen).
|
| Oddly, you can get close to what you are asking for with
| a M1 Mac Studio - 128GB of decently fast memory with a
| GPU that is ~0.5x a 3090 in training.
| amelius wrote:
| I guess this would only become a reality if games started
| requiring these cards.
| londons_explore wrote:
| Slower training tends to be only a little cheaper,
| because most modern architectures parallelize well, and
| they just care about the number of flops.
|
| If you want to reduce cost, you need to reduce the model
| size, and you'll get worse results for less money.
| ofcourseyoudo wrote:
| Similarly maybe we should only let people rent a NanoGPT box
| if they are over 25 and they have to get collision insurance.
| swader999 wrote:
| You have to gas it up and heaven help you if it gets a
| scratch or a scuff.
| speed_spread wrote:
| Great news! Cloud instances energy usage is included in
| their price, and because they're remote and transient it's
| impossible to permanently damage them.
| DesiLurker wrote:
| but you still have to pay for network ingress/egress
| traffic.
| aequitas wrote:
| I think the equivalent of being not careful and getting a
| dent in this context is to leave it open to the internet
| and having a bitcoin miner installed.
| idonotknowwhy wrote:
| A better fit would be, if you have unlimited liability
| like with AWS, and you leak your key pair. Then someone
| runs up a 100k bill setting up mining instances
| Aissen wrote:
| You free the instance and the miner is gone.
| iso1631 wrote:
| As you are paying for the resources you use that's fine.
|
| The closest would be if you used some form of software
| bug to actually cause physical damage, certainly not
| impossible, but extremely unlikely compared with actually
| physically damaging a car.
| Apofis wrote:
| Let's not forget that rendering 3D Animations in 3DSMAX or
| Maya used to take days for a single frame for a complex
| scene, and months for a few minutes.
| kzrdude wrote:
| How are universities and colleges dealing with this kind of
| demand for computing power? It must be hard to be able to do
| some courses now.
| CuriouslyC wrote:
| Most decently large colleges have been investing in HPC for a
| while, and started investing in GPU HPC around 2014. You'd be
| surprised what sort of school projects the compute budget
| exists for.
| r3trohack3r wrote:
| I went to a smallish state university, even there we had
| our own HPC center and lab. We had a proper HPC (IIRC) 6
| row data center across campus and we had a continuous
| budget available to me as an undergraduate research
| assistant for building beowulf clusters for the graduate
| programs to run assignments on. I once got an allowance to
| buy 15 raspberry pis to build an arm cluster.
| TrackerFF wrote:
| As far as research groups go - they get funds (project
| grants, donations, etc.) to purchase machines and parts, and
| then users have to timeshare them.
|
| These machines are pretty much crunching numbers 24/7, and
| your project will get appended to a queue.
| londons_explore wrote:
| 'group project'
| sebastianconcpt wrote:
| What's the applicability? Can you give me some examples of what
| can be used this for?
| wongarsu wrote:
| I imagine this might be interesting for domain-specific GPT
| models. Say training it on a mountain of technical
| documentation, or on every fanfiction published on the
| internet, or a sentiment analysis dataset. Of course fine-
| tuning GPT3 would give better results, but nanoGPT might allow
| you to make a much smaller model that's still good enough, to
| enable cheaper inference.
|
| Also the opportunity to play around with all the parameters
| fairly cheaply to find improvements. The todo section of the
| readme gives a small taste of that. Making bigger models works
| for OpenAI, but maybe the rest of us manage to make small
| models just perform better instead.
| albertTJames wrote:
| Curious to know how close that training loop is to actual openai
| code.
| buzzdenver wrote:
| For an AI noob like me: can you use spot instances to train
| models? They are about 1/3rd the price on AWS compared to on
| demand ones, so it'd make a significant difference.
| yreg wrote:
| Why not? This is the exact use case of what Spot instances seem
| to be for. (Not hosting a service, but just calculating
| something for yourself.)
| satvikchoudhary wrote:
| Yes you should use them. They can be taken away from you with 2
| min notice. (It doesn't happen a lot in practice though. I have
| been running a different instance for over a month. AWS doesn't
| force you if they don't have to)
|
| If you are going to run a long training job, ensure you are
| creating checkpoints. Be sure to use persistent storage, EBS
| and ensure that you check the option that it doesn't get
| deleted if the instance is stopped, so your checkpoint remain
| in the disk and you can easily restart.
|
| I haven't tried it but prices here are much cheaper.
| https://vast.ai/#pricing
| belter wrote:
| Yes you can. In Oregon you could eventually get this instance
| at $9. I say eventually, because of course Spot allocation is
| not guaranteed. ( And neither is On Demand ...but that is a
| story for another day)
|
| https://aws.amazon.com/ec2/spot/pricing/
| drjuice wrote:
| [flagged]
| henkdehenker wrote:
| Karpathy is such a boss!
| yreg wrote:
| Is there any trained model for text generation that you can run
| locally yet?
| throwaway743 wrote:
| Plenty. Huggingface alone has a ton
| deqwer wrote:
| There's LAION working on open source[1] version of chatGPT
|
| [1] https://github.com/LAION-AI/Open-Assistant
| Metus wrote:
| This should be way higher up.
| turmeric_root wrote:
| Though their roadmap doc says they're looking into finetuning
| existing GPT-J/T5 models for this task. So you'll probably
| want a 3090 (24GB VRAM) and at least 16GB of CPU RAM to run
| inference if/when the project is complete.
| wongarsu wrote:
| GPT2 can be run locally (on a somewhat beefy consumer GPU)
| karmajuney wrote:
| Can you add some info on what consumer GPU would be needed
| for this? Would a 3080 be able to handle this?
| wongarsu wrote:
| Assuming you get the 12GB version of the 3080. A 2080TI is
| another option. Though you can reduce precision or use one
| of the smaller GPT2 versions to run on smaller cards as
| well.
| minimaxir wrote:
| The original GPT-2 small (the 124M one) can run on a CPU,
| just slowly and not scalably.
| iamflimflam1 wrote:
| I think the link should be: https://github.com/karpathy/nanoGPT
| [deleted]
| taylorius wrote:
| Wow, this is great. I can't wait for the video lecture,
| transformers are an aspect of modern machine learning that I'm
| not completely clear on. Andrej's lectures are brilliant - super
| detailed, and really answer the detailed questions I always have.
| Great stuff!
| theGnuMe wrote:
| How critical are training warmups and is an iteration here the
| same as an epoch?
| Terretta wrote:
| 14 hours ago: https://news.ycombinator.com/item?id=34331919
|
| Curious why HN didn't merge the submission as it usually does. Is
| there a "no, submit this again" option?
| eismcc wrote:
| The other post probably didn't make it to the front page
| marviel wrote:
| I have taken several masters-level courses in Machine Learning --
| and even with those credentials, I cannot recommend _enough_
| Andrej 's youtube series, "Neural Networks: Zero to Hero". There,
| he teaches you, from scratch, how to build everything from the
| underlying automated gradient calculation system in pytorch, all
| the way up to the slower version of this model - `MinGPT`.
|
| [1]
| https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...
|
| (edit: self-promo: I'm currently working on a Typescript follow-
| through of this same series of video lectures, if you want to
| follow along with stronger types for explanation:
| https://github.com/Marviel/lab-grad)
| randoglando wrote:
| How does it compare to fast.ai? As a engineer looking to learn,
| what should I start with?
| marviel wrote:
| Both are good for different things.
|
| Fast.AI is great, but it takes the top down, vs the bottom
| up, approach. It takes you from a production-level black box
| that you don't understand, down to the details. The benefit
| there is you get good high-level intuition of how it behaves
| at the "let me use this technology for a job" level.
|
| Separately, the fast.ai library is also highly recommendable
| -- it comes with some state-of-the-art image recognition
| models, and its training wrappers are really helpful
| particularly for image-recognition dataset training.
|
| Karpathy's "Neural Networks: Zero to Hero" video series
| starts at the level of individual neurons, and works you up
| to the final product. For some reason both this style, and
| Karpathy's conciseness appeal to me slightly more. I'm also
| super detail-oriented, though -- and any level of "hand
| waving" (even if further explanation comes later) always
| bothers me. He's also got some pretty high-profile industry
| experience which carries some weight with me.
|
| But I'll say that both are really high-quality. --
| ultimately, my recommendation would be to follow whichever
| one speaks most to you personally after the first 1hr or so.
|
| EDIT: Per Jeremy's response below, if you want the bottom-up
| approach but like the fast.ai teaching style, you should
| check out "part 2" of the fast.ai set of tutorials, which is
| exactly that.
| jph00 wrote:
| fast.ai has both - the "part 1" section is top-down, and
| the "part 2" section is bottom up. You can do part 2
| without having done part 1. Part 2 starts with implementing
| matrix multiplication from scratch, then backprop from
| scratch, then SGD from scratch, etc.
|
| There will be a new version of the part 2 course out in a
| few weeks. It even covers stuff like random number
| generation from scratch, convolutions from scratch, etc. It
| gradually works all the way up to Stable Diffusion.
|
| @karpathy's and the fast.ai lessons work well together.
| They cover similar topics from different angles.
|
| (I'm the primary creator of the fast.ai courses.)
| marviel wrote:
| That's awesome! I did not know that part 2 was structured
| this way, and will check it out. Will be really neat to
| see you teach stable diffusion.
|
| Thanks for your work on fast.ai!
| jwithington wrote:
| Jeremy @ Fast.ai says he takes this pedagogical approach
| because it's "proven" to be the best way to learn. He's
| probably right, but I do find it confusing at times because
| in the beginning you're just hitting ctrl + enter on a
| IPYNB haha.
|
| Maybe Karpathy's approach will speak to me more--thanks for
| the recommendation!
| brap wrote:
| I can't believe I just spent 2 and a half hours glued to my
| phone in bed watching this, for absolutely no reason other than
| it was such an interesting intro (to a subject I'm already
| familiar with). Thanks for the recommendation, and thanks
| Andrej for making this!
| jwithington wrote:
| What would I google to figure out how to productionize the output
| of this?
|
| This repo trains a model--how would I prompt it and print the
| generated output?
| mittermayr wrote:
| As someone who's been in software for almost 25 years now, I read
| through this in amazement of how much new stuff still keeps
| coming in. This industry never stops and that makes it such a
| fascinating (but arguably harsh) world to be in.
|
| Looking at this feels like seeing the source code of a 64k demo,
| learning about Mode 13h and trying to replicate it in Turbo
| Pascal.
|
| And, much like the old days of graphics programming, there's a
| good chance all of this knowledge will be mostly irrelevant soon,
| as the abstraction layers tend to come quicker and quicker and
| take care of the hard foundational work underneath. Then it'll be
| many of us here discussing whether or not it was good to have
| been with it from the start, to really get it, or whether playing
| with the highly-abstracted components is all that's needed to
| succeed with it.
|
| Either way, super cool to see the pace here and I loved the "I
| only have a macbook" section.
| eismcc wrote:
| It will be funny to look back from the future and think, wow,
| how did we get anything done with only 40GB RAM
| [deleted]
| lossolo wrote:
| > reproduces GPT-2 (124M) on OpenWebText, running on a single
| 8XA100 40GB node in 38 hours of training
|
| For comparison GPT-3 has more than 1000x more params (175B) and
| training time was around 2 months on ~1500 V100 GPUs which costs
| millions of dollars in cloud compute costs. Gopher with 280B
| params was trained on 4096 TPU-v3 chips, Microsoft Megatron-
| Turing NLG 530B trained on 2240 NVIDIA A100 cards (each card
| costs ~15k USD). And the most mind blowing is PaLM from Google
| with 540B params and trained on 6144 TPU v4, which costs around
| 10-30M USD in cloud compute to train.
| marsven_422 wrote:
| [dead]
| justusthane wrote:
| This is a dumb question about language models in general, not
| necessarily specific to NanoGPT: why is all the focus on
| training? Can I download and run a pre-trained model locally?
| Surely the specs required to run a model are much, much lower
| than those required to train the model?
| ausbah wrote:
| inference can still be a bottleneck i think since you usually
| load the whole thing into memory which is 32-64GB+ usually?
| visarga wrote:
| Language models range from 1 to 300+ GB when loaded. It
| depends on how you load them, if you load in int8 you get 4x
| reduction.
| code_runner wrote:
| I believe the training is where the architecture of the model
| is most apparent. You can absolutely download plenty of pre-
| trained models.
|
| You will also _probably_ need to fine tune for a specific use
| case, so a common approach is downloading a pre-trained model
| and fine tuning.
|
| I think including the "from scratch" tuning script is
| educational more than anything else.
| anon291 wrote:
| If you're only using pre-trained models, it's going to be
| harder to differentiate yourself. Training / specialization of
| models is where the moat-building is (due to access to
| different data sets / better ideas). By specializing /
| training, more of the token limit can be used for generation
| rather than prompting / better prompts can be made.
|
| The lower the cost of training, the more profitable any
| resultant business. You can even envision businesses that train
| the model regularly to bring in new knowledge. The cheaper this
| is, the more opportunities open up.
| nerdponx wrote:
| It's the equivalent of building from source versus downloading
| a compiled binary.
|
| Also you can perform "fine tuning" which means you start with a
| trained model and train it further on your own data, allowing
| you to customize the model for specific tasks.
| swader999 wrote:
| Would it be possible to take all my user manuals and past
| customer Q&A and train on just on that to produce a customer
| helper chat bot?
| 1986108 wrote:
| 638c7215
| QuadrupleA wrote:
| Doesn't huggingface have dozens of freely available pretrained
| models like this (including various sized implementations of
| GPT2) and isn't the source available on most if you wanted to
| train them yourself?
|
| All I see in the comments is praise for the author as a person,
| so just wondering what's unique about this that's not available
| elsewhere? 730 upvotes and counting, assuming I'm missing
| something...
| moneywoes wrote:
| The shilling seems intense
| minimaxir wrote:
| Additionally, in terms of the streamlining nanoGPT porports,
| HuggingFace's implementations play nice with optimization
| techniques such as ONNX/TensorRT, which will give you better
| performance than anything PyTorch-based even if minimal.
|
| That doesn't mean an ONNX-ed nanoGPT won't be better, but the
| field of optimized text generation isn't as new as people
| claim.
| visarga wrote:
| This is a didactic implementation. If you read the HuggingFace
| repo it is much more abstracted on account they implement many
| models in the same codebase. It's not fast or big, just easier
| to read and tweak.
| isoprophlex wrote:
| True, but the use cases arent the same. As he did before for
| other models, he has a knack for distilling the code down to
| beautiful, self-contained examples of high didactic value.
|
| It's an order of magnitude easier to grok the basics from this
| repo than from going through (admittedly more ergonomic or
| performant or production-ready) huggingface repos.
| brossinthuon wrote:
| https://news.ycombinator.com/item?id=34336386
| rsiqueira wrote:
| I could not find any sample (prompt and results). Can anyone
| provide samples of it's quality, even if it is in a narrow field
| of knowledge or specific use case? I tried GPT2, GPT-J 6B and
| GPT-NeoX 20B (implementation by Fabrice Bellard at
| textsynth.com/playground.html) but I could not find any
| production-quality scenario yet, only cherry-picked simple cases.
| boredemployee wrote:
| That's what I really miss to conclude if I should try it myself
| or not.
| visarga wrote:
| At this model size quality is not worth discussing. It is
| clearly another league from GPT-3.
| lossolo wrote:
| Indeed, it is like comparing the speech of a 2-year-old child
| to that of a college professor.
| awestroke wrote:
| Are there any possible technologal or scientific leaps on the
| horizon that would reduce training time by an order of magnitude
| or more? GPT-3 took 355 years to train with incredibly expensive
| hardware, which means small players have no chance to push the
| state of the art
| imtringued wrote:
| As models get bigger less and less neurons get activated by any
| given input. If you can somehow predict which neurons get
| activated you can skip the vast majority of the computational
| load. I have read a paper where they argued that only 0.5% of
| the neurons are actually active in a 200 million parameter
| model so you can get a 200x improvement just from that.
|
| What this tells you is that there is very little money in
| optimizing deep learning and that NVIDIA has made it very easy
| to just throw more hardware at then problem.
| CuriouslyC wrote:
| This is hard a-priori, but fairly easy post-facto. Model
| distillation isn't a common practice yet, but it has already
| been demonstrated to be quite effective for specific use
| cases.
| visarga wrote:
| Distillation works but somehow we see very few papers doing
| it at this scale.
| visarga wrote:
| > argued that only 0.5% of the neurons are actually active in
| a 200 million parameter model so you can get a 200x
| improvement just from that
|
| Yes, but you don't know which 0.5% depending on the input
| text.
| londons_explore wrote:
| > very little money in optimizing deep learning
|
| Oh - there are a _lot_ of people working on optimizing AI.
| Amongst hobbyists, academia, and corporations alike.
|
| The thing is, if you come up with a neat optimization that
| saves 30% of compute for the same results, typically instead
| of reducing your compute budget 30%, you instead increase
| your model/data size 30% and get better results.
| narrator wrote:
| Jevon's paradox of data and AI. The more efficiently data
| is used, the more demand their is for data.
| antognini wrote:
| Any state of the art model takes about three weeks to
| train.
| visarga wrote:
| More an indication of human patience than task
| difficulty.
| WithinReason wrote:
| Do you have a link to that paper by any chance? By "neurons"
| did they mean weights or activations?
| imtringued wrote:
| Here is a GPU implementation.
|
| https://ieeexplore.ieee.org/document/9635657
|
| It is somewhere from 8x to 25x faster than doing dense
| machine learning. The speedup was higher on the original
| CPU implementation and the GPU paper mentions that if there
| isn't enough shared memory on the GPU it will have to
| switch to an algorithm that has more overhead.
|
| By neurons I actually meant "nodes"
|
| My comment is effectively a summary of this article:
| https://www.kdnuggets.com/2020/03/deep-learning-
| breakthrough...
|
| Edit: There is a paper for sparse spiking gradient descent
| promising a 150x improvement. I am not sure how practical
| this is because spiking neural network hardware heavily
| limits your model size but here it is:
|
| https://arxiv.org/abs/2105.08810
| omeysalvi wrote:
| I think AI is going to go the way of the hard sciences where
| the age of tinkerers making progress by leaps and bounds in
| their basement is over and incremental progress is going to be
| the domain of universities or large private companies that can
| afford to throw money behind it. I would love to be proven
| wrong and see radical shifts in how people approach these
| problems. Seems like the cycle started and got to this point
| way too soon for AI though
| swalsh wrote:
| Tinkerers can fine tune a model though. Unfortunately most
| fine tuning seems to be outmatched at the next iteration of
| the model.
| mittermayr wrote:
| My take on this is that (good) content is one of the bigger
| problems still, particularly also who exactly the original
| training data belongs to (or where it comes from). There's a
| certain risk (we'll see with Github CoPilot soon) it will
| slow down for a bit until the licensing issues are all sorted
| out. This can only be solved (for now) by bringing in public
| funding/data, which universities have always been a very good
| proxy for. Which also means it (usually) should be open
| access to the public, to some extent (and useful for the
| garage folks to catch up a bit). But, once we're past that,
| it'll be all about that giant body of pre-trained data,
| securely kept within the next Facebook or Microsoft,
| amounting to literal data gold (just much higher value at a
| lot less weight).
| make3 wrote:
| small players will never have a chance to push the state of the
| art, as whatever optimization there is will also be applied at
| large scale with more money
| cypress66 wrote:
| A lot of SOTA comes from small players. It just isn't the
| case for LLMs.
| awestroke wrote:
| Good point, but perhaps a leap could take small players into
| territories of language models that are large enough to be
| useful. GPT-3 crossed that threshold
| hankman86 wrote:
| Take a leaf from Seti@Home's book and try to come up with a
| distributed, volunteer-based approach to training an open
| source LLM. There is already an enormous amount of suitable
| ML hardware on end user devices.
| Der_Einzige wrote:
| Huggingface actually recently did this, but I think it's
| for inference on their giant models like BLOOM
| belter wrote:
| Model size does not necessarily correlates to quality of
| results.
|
| "Chinchilla (70B) Greatly Outperforms GPT-3 (175B) and Gopher
| (280B)" - https://towardsdatascience.com/a-new-ai-trend-
| chinchilla-70b...
| Der_Einzige wrote:
| I highly doubt this in practice on a large scale. Outside of
| the common phenomena of "most large NNs are under trained"
| and "less better data is sometimes better than more worse
| data", there are no other obvious mechanisms to explain why a
| smaller model with same or similar architecture would be
| better than a larger one.
|
| I claim instead that we are still hardly scratching the
| surface with how we evaluate NLP systems. Also, some fields
| have straight up trash evaluation schemes. Summarization and
| ROGUE scores are totally BS and I find the claim that they
| even correlate with high quality summaries suspect. I say
| this with publications in the that subfield, so I have
| personal experience with just how crummy many summarizes are.
| WithinReason wrote:
| _there are no other obvious mechanisms to explain why a
| smaller model with same or similar architecture would be
| better than a larger one._
|
| Overfitting?
| Der_Einzige wrote:
| The consensus seems to be that the majority of LMs are
| _undertrained_ not overfitting though.
| espadrine wrote:
| An interesting outcome of the nanoGPT repo is this struggle
| to exactly match the Chinchilla findings[0], even after
| discussing it with the authors.
|
| A larger discussion is that the scaling laws achieve loss-
| optimal compute time, but the pre-training loss only improves
| predictions on the corpus, which contains texts written by
| people that were wrong or whose prose was lacking. In a real
| system, what you want to optimize for is accuracy,
| composability, inventiveness.
|
| [0]: https://github.com/karpathy/nanoGPT/blob/master/scaling_
| laws...
| noidiocyallowed wrote:
| Work together and fuck up companies together. That's the way to
| go.
| [deleted]
| abricq wrote:
| Or how to apply communism to software engineering. I like
| that.
|
| More seriously, the risk that a few companies become _even
| more_ powerful thanks to their restricted access to such NN
| is very frightening. The worth is, without legal
| restrictions, there is nothing that we can do against it. And
| I doubt that legal restrictions come in the next months /
| years.
| beepbooptheory wrote:
| Well at that point, some people might have the crazy crazy
| insight that no matter how big the model is, or how many
| GPUs they have, it burns up all the same.
| rjtavares wrote:
| Small players should focus on applications of this tech.
|
| We now know that whatever AI Models succeed in the future,
| they'll be trained by a huge company and finetuned to a
| specific use case. Small companies should be working on use
| cases, and then just upgrade to the latest SOTA model.
| varispeed wrote:
| > Small players should focus on applications of this tech.
|
| That sounds a bit condescending. We are probably at a point
| from which the government should intervene and help establish
| level playing field. Otherwise we are going to see a deeper
| divide between multibillion businesses conquering multiple
| markets and sort of neofiefdom situation. This is not good.
| rjtavares wrote:
| I'm not being condescending at all, we've learned that the
| value in AI is in the applications. If you think government
| should regulate the field, it should be to make AI Models a
| commodity, like electricity.
| tiborsaas wrote:
| It's not that condescending, that's todays reality. Should
| I feel entitled for $600k training time that may or may not
| work? Do you think the government is a good actor to judge
| if my qualifications are good enough to grant me resources
| worth a house?
|
| It's quite reasonable to make use of models already trained
| for small players.
| mschuster91 wrote:
| > Do you think the government is a good actor to judge if
| my qualifications are good enough to grant me resources
| worth a house?
|
| Governments already routinely do that for pharmaceutical
| research or for nuclear (fusion) research. In fact,
| almost _all_ major impact research and development was
| funded by the government, mostly the military. Lasers,
| microwaves, silicon, interconnected computers - all
| funded by the US tax payer, back in the golden times when
| you 'd get laughed out of the room if you dared think
| about "small government". And the sums involved were
| ridiculously larger than the worth of a house. We're
| talking of billions of dollars.
|
| Nowadays, R&D funding is way WAY more complex. Some
| things like AI or mRNA vaccines are mostly funded by
| private venture capital, some are funded by large
| philanthropic donors (e.g. Gates Foundation), some by the
| inconceivably enormous university endowments, a lot by
| in-house researchers at large corporations, and a select
| few by government grants.
|
| The result of that complexity:
|
| - professors have to spend an absurd percentage of their
| time "chasing grants" (anecdata, up to 40% [1]) instead
| of actually doing research
|
| - because grants are time-restricted, it's rare to have
| tenure track any more
|
| - because of the time restriction and low grant amounts,
| it's _very_ hard for the support staff as well. In
| Germany and Austria, for example, extremely low paid
| "chain contracts" are common - one contract after
| another, usually for a year, but sometimes as low as half
| a year. It's virtually impossible to have a social life
| if you have to up-root it for every contract because you
| have to take contracts wherever they are, and forget
| about starting a family because it's just so damn
| insecure. The only ones that can make it usually come
| from highly privileged environments: rich parents or,
| rarely, partners that can support you.
|
| Everyone in academia outside of tenured professors
| struggles with surviving, and the system ruthlessly
| grinds people to their bones. It's a _disgrace_.
|
| [1] https://www.johndcook.com/blog/2011/04/25/chasing-
| grants/
| tiborsaas wrote:
| Pharmaceutical or nuclear research doesn't really
| classify as "small scale" as this thread started. I know
| there are massive amounts of money handed our by
| governments to fund research, but for a 3 guy startup in
| a garage that's probably hopeless. Public money is cursed
| anyways, it's better not to touch it.
|
| I've also read it at many places, that academic research
| funding is way too misaligned. It's a shame, really.
| googlryas wrote:
| Do you think you'll get a global agreement on this? Or
| would china just eat America's lunch then?
| ignoramous wrote:
| Yes, see DeepMind RETRO:
|
| > _In our experiments on the Pile, a standard language modeling
| benchmark, a 7.5 billion parameter RETRO model outperforms the
| 175 billion parameter Jurassic-1 on 10 out of 16 datasets and
| outperforms the 280B Gopher on 9 out of 16 datasets._
|
| https://www.deepmind.com/blog/improving-language-models-by-r...
|
| Though, there hasn't been much follow-up research on it (or
| DeepMind is not publishing it).
|
| Annotated paper:
| https://github.com/labmlai/annotated_deep_learning_paper_imp...
| espadrine wrote:
| The research is still ongoing, although perhaps lower-profile
| than what appears in the press.
|
| RETRO did get press, but it was not the first retrieval
| model, and in fact was not SOTA when it got published; FiD
| was, which later evolved into Atlas[0], published a few
| months ago.
|
| [0]: https://github.com/facebookresearch/atlas
| QuesnayJr wrote:
| There are a couple of cases where small changes in the model
| make training much quicker. For example, the currently leading
| Go AI, KataGo, requires much less time to train than AlphaGo
| did.
| GistNoesis wrote:
| Yes. There are plenty forward leaps, most of them are not new
| and are just waiting to be integrated or released :
|
| Let's pave the road for SkyNet hard lift-off :
|
| -The first obvious one is use of external knowledge store, aka
| instead of having to store facts in the neural weights where
| they struggle, just store them in a database and teach your
| neural network to use it. (This is also similar to something
| like webgpt where you allow your network to search the web).
| This will allow you to have a network of 1G parameters (and
| external indexes of a few TB) that is as performant as a
| network of 100G parameters, and with better scaling property
| too. You can probably gain at least 2 orders of magnitude
| there.
|
| -The second leap is better architecture of your neural
| networks, approximating transformer that are quadratic compute
| by something that is linear compute (linformer) or n log n
| compute (reformer) can get you an order of magnitude faster by
| simply reducing your iteration time. Similarly using some
| architectures based on sparsity can give you faster computation
| (although some of the gains are reduced by lesser efficiency of
| sparse memory access pattern). Using (analog bits) Diffusion to
| Generatively PreTrain sentences at a time instead of token by
| token. You can probably gain between 1 and 3 order of magnitude
| here if you write and optimize everything manually (or have
| your advanced network/compiler optimize your code for you)
|
| -The third leap is reduced domain : You don't have a single
| network that you train on everything. Training one network by
| domain allows you to have a smaller network that compute
| faster. But also it allows you to focus your training on what
| matters to you : for example if you want to have a mathematics
| network, its parameters are not influenced a lot by showing it
| football pictures. There is at least 2 orders of magnitude
| there.
|
| -The fourth one is external tool usage. It's kind of related to
| the first one but whereas in the first one is readily
| differentiable, this one necessitate some Reinforcement
| Learning (that's what decision transformer are used for).
|
| -Compression : compress everywhere. The bottlenecks are memory
| bandwidth related. Work in compressed form when relevant. One
| order of magnitude
|
| -Distributed training : Because the memory bandwidth of inside
| a GPU is in the order of TB/s where as the transfer to the GPU
| is in the order of 10GB/s. There is an advantage to have the
| parameters reside on the GPU but there is a limited quantity of
| memory in the GPU, so distributed training (something like
| petals.ml) allows you to increase your memory bandwidth by
| collaborating. So each actor can probably gain an order of
| magnitude. Provided that they can keep bad actors away.
|
| -Use free resources : The other day Steam had 10M users with
| GPU waiting around doing nothing, just release a dwarf fortress
| mod with prettier pictures and use the compute for more
| important tasks.
|
| -Remove any humans in the loop : it's faster to iterate when
| you don't have to rely any human, either for dataset
| construction or model building
|
| :)
| seydor wrote:
| It should be no issue if it became massively parralelized a-la
| SETI. I wonder when Wikimedia or Apache foundation will jump
| into AI
| yreg wrote:
| Wikimedia and other organizations that deal with moderation
| might want to keep this technology out of the hands of the
| general public for as long as possible.
| isthisthingon99 wrote:
| How long does it take to train a human? It's useless for two
| years then maybe it can tell you it needs to poop.
|
| The breakthrough will be developing this equivalent in an
| accessible manner and us taking care to train the thing for a
| couple of decades but then it becomes our friend.
| licebmi__at__ wrote:
| Yes, but to be fair, the system that does the training really
| sucks and doesn't scale.
| cactusplant7374 wrote:
| Neither does OpenAI. It costs so much and still delivers so
| little. A human can generate breakthroughs in science and
| tech that can be used to reduce carbon emissions. ChatGPT
| can do no such thing.
| VBprogrammer wrote:
| What percentage of humans make meaningful contributions
| to advancing science or technology? The overwhelming
| majority of us are just worker bees servicing the needs
| of the human population.
| cactusplant7374 wrote:
| I agree with you on this point. It's also arguable that
| less people with a better education system could yield
| the same result with less environmental impact.
|
| But my point, poorly explained, is that whatever ChatGPT
| is, it isn't original or creative thought as a human
| would do it.
|
| Chomsky's example (which is based off Turing): Do
| submarines swim? Yes, they swim -- if that's what you
| mean by swimming.
| jsjohnst wrote:
| > What percentage of humans make meaningful contributions
| to advancing science or technology?
|
| I'm a nobody that you've never heard of and I've arguably
| made meaningful contributions. If that's true, don't you
| think there could be way more people out there than you
| or sibling commenter imply?
| tlb wrote:
| You can't know that. Currently, 8 billion humans generate
| a few scientific breakthroughs per year. You'd have to
| run several billion ChatGPTs for a year with zero
| breakthroughs to have any confidence in such a claim.
| mbrock wrote:
| With billions of GPT output streams, how do you actually
| discover and rank what's significant? Screen them through
| some even more powerful models? I imagine it's like a
| volcano eruption of text where some are absolutely
| brilliant and most is worthless and finding the jewels is
| even more demanding than generating it all.
| tlb wrote:
| Some theories are easily testable. For instance, ask it
| to write some code to efficiently solve traveling
| salesman problems, and then test the code on some sample
| problems. You can score the quality of solutions and time
| taken, and manually inspect the best ones.
| cactusplant7374 wrote:
| At this point there is no framework that suggests GPT
| understands the underlying data. It can't assign meaning
| as a human would. It can't consume hundreds of math
| textbooks and learn the principles of math and then apply
| them more broadly to science textbooks and research
| papers. It can't even reliably add two numbers.
|
| Yes, brute forcing with hard AI can produce many
| thoughts. But the AI wouldn't know they are correct. It
| couldn't explain why. Any discovery would only be
| attributable to randomness. It wouldn't be learning from
| itself and its priors.
| naasking wrote:
| > At this point there is no framework that suggests GPT
| understands the underlying data. It can't assign meaning
| as a human would.
|
| Actually there are many indications that GPT understands
| the data, because its output mostly makes sense. The
| reason it can't assign meaning the way a human would is
| because a human can correlate words with _other sensory
| data_ that GPT doesn 't have access to. That's where GPT
| creates nonsense.
|
| Think carefully about what "understanding" means in a
| mechanistic sense. It's a form of compression, and a few
| billion parameters encoding the contents of a large part
| of the internet seems like pretty good compression to me.
| ivanbakel wrote:
| GPT doesn't display understanding of purely abstract
| systems, so I doubt it's an issue of lacking sensory
| information. It can't consistently do arithmetic, for
| example - and I think it would be presumptuous to insist
| that sensory information is a prerequisite for
| mathematics, even though that's how humans arrived at it.
| naasking wrote:
| It's not yet clear why it struggles with arithmetic. It
| could be data-related, could be model-related, although
| scaling both seems to improve the situation.
|
| In any case, GPT could still understand non-abstract
| things just fine. People with low IQ also struggle with
| abstract reasoning, and IQ tests place GPT-3 at around
| 83.
| isthisthingon99 wrote:
| I still think that this will be a major form of AI that is
| accessible to the public at large and it will enable
| productivity improvements at all levels.
|
| I'm not joking, this is really something I think
| will/should happen.
| xur17 wrote:
| Alternatively, are there ways to train on consumer graphics
| cards, similar to SETI@Home or Folding@Home? I would personally
| be happy to donate gpu time, as I imagine many others would as
| well.
| mryab wrote:
| There absolutely are! Check out hivemind
| (https://github.com/learning-at-home/hivemind), a general
| library for deep learning over the Internet, or Petals
| (https://petals.ml/), a system that leverages Hivemind and
| allows you to run BLOOM-176B (or other large language models)
| that is distributed over many volunteer PCs. You can join it
| and host some layers of the model by running literally one
| command on a Linux machine with Docker and a recent enough
| GPU.
|
| Disclaimer: I work on these projects, both are based on our
| research over the past three years
| alfor wrote:
| The cost of moving data from one gpu to the next will destroy
| performance.
|
| The system are moving in the opposite direction (look at Dojo
| architecture or TensTorrent)
|
| The silver lining is that the cost of training will fall
| substantially with those architecture that are not based in
| reusing gpu.
| breck wrote:
| > Are there any possible technologal or scientific leaps on the
| horizon
|
| Yes. From 2017: "Prediction 4: The simplest 2D text encodings
| for neural networks will be TLs. High level TLs will be found
| to translate machine written programs into understandable
| trees."
|
| We have something coming out that is an OOM better than
| anything else out there right now.
| spi wrote:
| What do you mean by "small players have no chance"? OpenAI was
| founded in 2015, it used to be a "small player" which just got
| things right and grew with it - we're not talking of Google or
| Facebook investing a chunk of their billions cash. In Germany,
| AlephAlpha has built their own supercomputer and are training
| similar sized models. It's expensive for sure, but well in the
| possibilities of startups. In France researchers trained the
| similarly sized BLOOM model
| https://huggingface.co/bigscience/bloom. They claim it cost
| between $2 and $4 millions.
|
| Sure, a single researcher can't replicate this at their
| university, but even though OpenAI likes to publish it this
| way, we're not really talking about research here. Research was
| inventing the transformer architecture, this is just making it
| bigger by (very smart) engineering choices. It's something
| companies should do (and are doing), not researchers.
| SilverBirch wrote:
| OpenAI was founded in 2015 by a group of billionaires who
| pledged $1Bn of funding. That is hardly a small scrappy start
| up.
| awestroke wrote:
| Microsoft (using Azure DCs) built a supercomputer with 10,000
| V100 GPUs exclusively for OpenAI. [0]
|
| It is estimated that it cost around $5M in compute time to
| train GPT-3.
|
| OpenAI has received billions in investment prior to launching
| GPT-3, including $1B from Microsoft in 2019.
|
| [0]: https://blogs.microsoft.com/ai/openai-azure-
| supercomputer/
| nileshtrivedi wrote:
| > we're not talking of Google or Facebook investing a chunk
| of their billions cash
|
| OpenAI had raised $1B from Microsoft in 2019 and used it to
| train a 175B param model. Now, they have raised $10B and are
| training GPT-4 with 1.5T params. GPUs are capital intensive
| and as long as there are returns to bigger models, that's
| exactly where things will go.
| andy_ppp wrote:
| Will 1.5T parameters be possible to run in the public way
| GPT-3 is? I can't wait to see what happens with this much
| learning!
| awestroke wrote:
| I can't find any source on the 1.5T params number. I'd love
| to read more if you have any links to share. Thanks
| wut42 wrote:
| afaik, gpt-4 is mostly rumours so far, same thing for the
| 1.5T number. gpt-4 is suerly coming.
| wnkrshm wrote:
| Maybe it will be called GPT-XP by then, with Microsoft
| owning half of it.
| belter wrote:
| Looking forward to see GPT-4 recommending Linux and Libre
| Office instead of Windows/Office as the logical choice
| out of 250 IQ ML Model...
| ben_w wrote:
| In my imagination, OpenAI does what Bungie did when MS
| bought them, and open-sources what used to be their crown
| jewels.
|
| That said, GPT-AlephOne only makes sense if there's a
| preceding GPT-[?].
| egorfine wrote:
| They have got to release GPT-3.11 For Workgroups first.
| awestroke wrote:
| Or GPT-365
| MikeDelta wrote:
| Then they can bring back the talking paperclip, but this
| time actually useful.
| generalizations wrote:
| It _could_ actually work. It would be an incredibly gutsy
| move and I love it, and they 'd probably earn a lot of
| respect. They'd get so much press for it. And if it held
| up, it'd probably be one of the things that MS is
| remembered for.
| int_19h wrote:
| Why not ask GPT itself what it wants to be called?
| wut42 wrote:
| Or GPT One.
| taneq wrote:
| GPT-10 will be evergreen and 'the last version of GPT'.
|
| And then three years later GPT-11 will be required to run
| the latest games.
| orbifold wrote:
| I am actually still unclear how AlephAlpha pulled that off
| and who funds them, since they have a rather low profile
| team.
| hdjjhhvvhga wrote:
| > we're not talking of Google or Facebook investing a chunk
| of their billions cash.
|
| On the contrary, in this thread we are are mainly talking
| about that.
| davidy123 wrote:
| Could this be distributed? Put all those mining GPUs to work. A
| lot of people like participating in public projects like this.
| I would!
| PartiallyTyped wrote:
| In theory, yes. "Hogwild!" is an approach to distributed
| training, in essence, each worker is given a bunch of data,
| they compute the gradient and send that to a central
| authority. The authority accumulates the gradients and
| periodically pushes new weights.
|
| There is also Federated Learning which seemed to start taking
| off, but then interest rapidly declined.
| naraga wrote:
| Exactly. This is inevitable imho. There is no way people will
| be ok to depend on few wall-gardened models.
| dmit wrote:
| >> GPT-3 took 355 years to train
|
| > Could this be distributed? Put all those mining GPUs to
| work.
|
| Nope. It's a strictly O(n) process. If it weren't for the
| foresight of George Patrick Turnbull in 1668, we would not be
| anywhere close to these amazing results today.
| CyberDildonics wrote:
| Why would an O(n) algorithm not be able to be distributed?
| davidy123 wrote:
| I couldn't find any references to George Patrick Turnbull.
| If that an ancestor of yours? If so, the comment seems
| rather subjective.
| taneq wrote:
| They're being facetious about the '355 years to train'
| thing. ;)
| davidy123 wrote:
| OK haha good one then. Mine was a bit too subtle.
| pprotas wrote:
| What does "355 years" mean in this context? I assume it's not
| human years
| mellosouls wrote:
| Claimed here, so this is presumably the reference (355 GPU
| Years):
|
| https://lambdalabs.com/blog/demystifying-gpt-3
|
| "We are waiting for OpenAI to reveal more details about the
| training infrastructure and model implementation. But to put
| things into perspective, GPT-3 175B model required 3.14E23
| FLOPS of computing for training. Even at theoretical 28
| TFLOPS for V100 and lowest 3 year reserved cloud pricing we
| could find, this will take 355 GPU-years and cost $4.6M for a
| single training run. Similarly, a single RTX 8000, assuming
| 15 TFLOPS, would take 665 years to run."
| dx034 wrote:
| That's still including margins of cloud vendors. OpenAI had
| Microsoft providing resources which could do that at much
| lower cost. It still won't be cheap but you'll be way below
| $5m if you buy hardware yourself, given that you're able to
| utilize it long enough. Especially if you set it up in a
| region with low electricity prices, latency doesn't matter
| anyway.
| Manfred wrote:
| Cumulative hours spent across training hardware.
| captainmuon wrote:
| I wonder about this, too. OpenAI's biggest 'moat' is that their
| model takes so much resources to train, not that their
| algorithms are particularly secret.
|
| One idea I had was to not use one single model to learn all
| steps of the task, but to break it up. The human brain has
| dedicated grammar processing parts. It is unclear whether
| something like a universal grammar exists, but we have at least
| an innate sense for rhythm. Applied to NLP, you could heavily
| preprocess the input. Tokenize it, annotate parts of speech.
| Maybe add pronunciation, so the model doesn't have to think
| about weird english spelling rules, and so you can deal with
| audio more easily later. So I would build all these little
| expert-knowledge black boxes and offer them as input to my
| network.
|
| But there is also some inherent resource cost in large language
| models. If you want to store and process the knowledge of the
| world, it is going to be expensive no matter what. Maybe we
| could split the problem into two parts: Understanding language,
| and world knowledge (with some messy middle ground). I believe
| you could replace the world knowledge with a huge graph
| database or triple store. Not just subject-verb-object, but
| with attribution and certainty numbers for every fact. The idea
| would be to query the database at inference time. I don't know
| how to use this in conjunction with a transformer network like
| GPT-3, so you'd likely need a very different architecture.
|
| The big benefit of this would be that it is feasible to train
| the language part without the world knowledge part with much
| less resources. But you have other benefits, too. ChatGPT is
| trained to "win the language game". But as they say, winning
| the argument does not make you right. If you have a clean fact
| database, you can have it weigh statements from trustworthy
| sources higher. You then basically have a nice natural language
| frontend to a logical reasoning system that can respond with
| facts (or better: conclusions).
| joaogui1 wrote:
| This biggest most is _high-quality_ data. Both their
| proprietary datasets (WebText, WebText2 etc), but also now
| their human-annotated data. Another secondary moat is their
| expertise with training models using PPO (their RL method),
| they can get results that are quite better than other labs. I
| say this moat is secondary because it 's possible that you
| can get similar results with other RL algorithms (e.g.
| DeepMind using MPO) and because maybe you don't really need
| RL from Human Feedback, and just fine-tuning on instructions
| is enough
| Metus wrote:
| I find OpenAI having exclusive access to that kind of high-
| quality data more concerning than them having access to
| their current amount of compute and currently trained
| model. A couple of million dollars worth of compute is in
| the realm of any medium sized research university, larger
| company or any country worth of mention. And seeing as
| Moore's law still applies to GPU, the cost will only fall.
|
| However _high-quality data_ is scarce. I would be willing
| to fund a proper effort to create high-quality data.
| visarga wrote:
| Check out DeepMind RETRO, it's one year old already, but
| exactly what you say:
|
| https://www.deepmind.com/publications/improving-language-
| mod...
| lossolo wrote:
| It's not just about compute; if that were the case, then
| models like BLOOM and OPT, which also have 175 billion
| parameters, would have the same performance for real-world
| use cases as GPT-3, but they don't. Datasets are also very
| important.
| ccozan wrote:
| GPT and human brain ( at least the language / speech part )
| have nothing in common. We, as humans, do not use language in
| a generative way, is derived from a higher or very low level
| of abstraction ( intentions, emotions, etc ) and is explictly
| use for communicating something. Even this text is based on
| previous knowledge, saved in an abstract way, and while
| writing this I must follow the synthax of the language or
| writing the right order otherwise, you , the person who reads
| this, will not understand what I mean. While GPT can generate
| the same text, it does not have a motivation and has no need
| to communicate ( while I just wanted to feel good by bringing
| some contribution on HN ).
|
| So yes, very different architecture.
| naasking wrote:
| These are conceptual "differences" that don't actually
| explain the mechanics of what's going on. For all you know
| "motivation", "intentions", etc. are also just GPT-like
| subsystems, in which case the underlying mechanics are not
| as different as you imply.
| mensetmanusman wrote:
| If it were gpt-like sub systems, humans would be emitting
| MWs of power instead of the 100W now.
|
| Whatever humans have it is many orders of magnitude
| better...
| ben_w wrote:
| That's the hardware it runs on, not the software
| architecture of GPT. I could equally say that transistors
| are faster than synapses by the same ratio that marathon
| runners are faster than continental drift.
| naasking wrote:
| Or biology evolved a better way to do the same or similar
| enough computation that we simply haven't yet discovered.
| ImHereToVote wrote:
| Emotion is just "spiritual" word for a utility function.
| Or terminal goal to be more precise.
| throwuwu wrote:
| It seems to me that a lot of everyday communication is
| rather statistical in nature. We don't necessarily think
| deeply about each word choice but instead fall back on well
| worn patterns and habits. We can be more deliberate about
| how we compose our sentences but most situations don't call
| for it. It makes me wonder if we don't all have a
| generative language model embedded in our brains that
| serves up the most likely next set of words based on our
| current internal state.
| thomastjeffery wrote:
| Ok top of it not having "motivation" to communicate, it has
| _literally nothing_ to be communicated in the first place.
|
| That's the key difference. We use language to express
| conceptualizations. We have some kind of abstract model
| somewhere that we are translating.
|
| Maybe it isn't a cohesive model either. All I can say for
| certain is that - whatever it is - we are expressing it.
|
| GPT does not express. It parrots. There is no
| conceptualization.
| captainmuon wrote:
| The more experience I get, the more I wonder if this is
| really the case for us. We certainly have some kind of
| abstract model in our heads when thinking deeply about a
| problem. But in many settings - in a work meeting, or
| socially with friends - I think it is a much more
| automatic process. The satisfaction you get when saying
| the right thing, the dread when you say something stupid:
| It is just like playing a game. Maybe the old
| philosophical concept of society as merely "language
| games" is correct after all. A bit silly but I find the
| thought makes annoying meetings a bit more bearable.
|
| But you are of course right with GPT, it has no inner
| life and only parrots. It completely lacks something like
| an inner state, an existence outside of the brief moment
| it is invoked, or anything like reflection. Reminds me of
| the novel "Blindsight" (which I actually haven't read
| yet, but heard good things about!) where there are beings
| that are intelligent, but not conscious.
| thomastjeffery wrote:
| Intelligent but not conscious would still be a few steps
| ahead of GPT.
|
| We can take a concept and refactor it symbolically. GPT
| can't do that. All it does is find symbols that are
| semantically close to other symbols.
| ben_w wrote:
| > and while writing this I must follow the synthax of the
| language or writing the right order otherwise
|
| A good example that is not, word randomised order and
| kombination with Mrs Spelling and fonetic spel-ing prevent
| ye knot that which I wrote you to komprehend.
|
| (My apologies to non-native speakers of English; if someone
| did that to me in German I'd have no clue what was meant).
|
| A better point is that GPT-3's training set is more tokens
| than the number of times an average human synapse fires in
| a lifetime, squeezed into a network with about 3 orders of
| magnitude fewer parameters than the human brain has
| synapses.
|
| It's wrong to model AI as anything like natural
| intelligence, but if someone insists, my go-to comparison
| (with an equivalent for image generators) is this: "Imagine
| someone made a rat immortal, then made it browse the web
| for 50,000 years. It's still a rat, despite being very
| well-trained."
| visarga wrote:
| > GPT and human brain have nothing in common
|
| Here we go again. They must have something in common,
| because for about 90% of the tasks the language model
| agrees with humans, even on novel tasks.
|
| > We, as humans, do not use language in a generative way
|
| Oh, do you want to say we are only doing classification
| from a short list of classes and don't generate open ended
| language? Weird, I speak novel word combinations all the
| time.
| ccozan wrote:
| No, what is meant is that the next word I speak/write
| after a current word are not based on a statistical
| model, but on a world model which includes a language
| structure based on a defined syntax and cultural variaty.
| I actually mean what I say while the ChatGPT just parrots
| around weights and produces an output based purely on
| statistics. There is zere modeling which translates into
| real world ( what normally we call "understanding" and
| "experience" ).
|
| As was said, a different architecture.
| naraga wrote:
| I think something of Seti@Home kind will come.
| karpathy wrote:
| Wow, fun to find this trending on HN this morning! I am currently
| also working on the associated video lecture (as the next episode
| of my video lecture series here https://karpathy.ai/zero-to-
| hero.html ), where I will build nanoGPT from scratch and aspire
| to spell everything out, as with the earlier videos. Hoping to
| get it out in ~2 weeks or so.
| TheAlchemist wrote:
| Thank you for your amazing work. Between cs231n and your recent
| videos, I've learned a ton - and you have a gift to explain
| things in such an easy and straightforward way, that I'm always
| feeling like an idiot (in a positive way) for not having
| grasped the concept before.
| katsucurry wrote:
| I've found all of your code and lessons on youtube so
| incredibly useful. You're a wonderful teacher and I really
| appreciate all the work you've done with this!
| imranq wrote:
| Just wanted to say thank you for all the incredible work and
| resources you publish. I've lost track of all the different
| skills I've learned from you, from computer vision, RNNs,
| minGPT, even speedcubing :D
| StefanWestfal wrote:
| Open accessible lectures / knowledge like yours allowed many
| people, me included, to turn their life around by putting in
| the effort and develop themselves. Thank you.
| subbu wrote:
| Your youtube playlist combined with NanoGPT and your Lex
| Fridman podcast is like having a university level degree with a
| free internship guidance. Thank you!
| goldenshale wrote:
| Bad ass! A great addition would be some content on tuning pre-
| trained language models for particular purposes. It would be
| great to have examples of things like tuning a GPT model
| trained on language and code to take in a context and spit out
| code in my custom API, or using my internal terminology. Not
| sure if this is RL based fine tuning or just a bunch of
| language to code examples in a fine tuning dataset? In essence,
| how can we start using language to control our software?
| karpathy wrote:
| Ty agree, most people practically speaking will be interested
| in finetuning rather than from-scratch pretraining. I
| currently have some language about it in readme but I agree
| this should get more focus, docs, examples, etc.
| marviel wrote:
| Your tutorials are effective and concise. Thank you for them!
| Accessible, from-scratch knowledge on these topics is essential
| at this time in history and you're really making a dent in that
| problem.
| eternalban wrote:
| Thank you for sharing your knowledge. Anything that can be done
| to democratize machine learning is an invaluable social
| service. Hats off to you.
| dsabanin wrote:
| Thank you for your great work!
| highfrequency wrote:
| Appreciate the work to make GPT training accessible!
|
| Do you leave hyperparams (like learning rate, batch size) the
| same when switching from 8xA100 to fewer GPUs, or do these need
| to be adjusted?
|
| Separately, when going from 8xA100 GPU to a single A100 GPU, in
| the worst case we can expect the same model performance after
| training 8x as long correct? (And likely a bit better because
| we get more gradient updates in with smaller batch size)
| de_nied wrote:
| Thank you for your constant contributions.
| moralestapia wrote:
| While doing my PhD some years ago (it wasn't a PhD on AI, but
| very much related) I trained several models with the usual
| stack back then (pytorch and some others in TF). I realized
| that a lot of this stack could be rewritten in much simpler
| terms without sacrificing much fidelity and/or performance in
| the end.
|
| Submissions like yours and other projects like this one
| (recently featured here as well) ->
| https://github.com/ggerganov/whisper.cpp, makes it pretty clear
| to me that this intuition is correct.
|
| There's a couple tools I created back then that could push
| things further towards this direction, unfortunately they're
| not mature enough to warrant a release but the ideas they
| portray are worth taking a look at (IMHO) and I'll be happy to
| share them. If there's interest on your side (or anyone reading
| this thread) I'd love to talk more about it.
| gtoubassi wrote:
| +1. I've benefited greatly from your content, e.g. your CNN
| lecture was incredibly accessible [0]. I still find
| transformers stubbornly elude my intuitions despite reading
| many descriptions. I would very much appreciate your video
| lecture on this topic.
|
| [0] I think https://www.youtube.com/watch?v=LxfUGhug-iQ
| misza222 wrote:
| Thanks for your work Andrej! I've been doing earlier lectures
| and this is absolutely fantastic educational content!
| cs702 wrote:
| Andrej: thank you!
|
| --
|
| To the mod (dang): IMHO Andrej's comment should probably be at
| the top of the page, not my comment. UPDATE: Looks like that's
| done. Thank you :-)
___________________________________________________________________
(page generated 2023-01-11 23:00 UTC)