hngopher.com

       [HN Gopher] NanoGPT
       ___________________________________________________________________
        
       NanoGPT
        
       Author : trekhleb
       Score  : 1270 points
       Date   : 2023-01-11 08:34 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cpdomina wrote:
       | To train small gpt-like models, there's also aitextgen:
       | https://github.com/minimaxir/aitextgen
        
         | minimaxir wrote:
         | As the creator of aitextgen, I'm mixed on continuing support
         | since there doesn't seem to be as much demand as expected for
         | _small_ GPT models given the success and cost-effectiveness of
         | GPT-3 /ChatGPT, unfortunately.
         | 
         | I still have a few ideas there (including another secret
         | approach at better text generation) but it's hard to determine
         | ROI.
        
           | mboof wrote:
           | I think what you have created still has great demand. It give
           | devs who do not have the budget or need for the gigantic
           | models, something to train and use for their own specific
           | language tasks.
           | 
           | Not everyone is trying to replicate CHATGPT results for
           | certain tasks.
        
       | jamesfisher wrote:
       | For casual readers like me: are there examples of what this can
       | do once trained? E.g. it mentions training on Shakespeare, but
       | gives no examples of fake Shakespearean.
        
         | naasking wrote:
         | The repo seems to imply that it matches GPT-2, so I imagine any
         | analyses of GPT-2 will give you a good idea.
        
           | kwerk wrote:
           | I'm not easily finding GPT-2 use cases. Any query guidance?
        
             | visarga wrote:
             | The GPT family of models shines above 100B parameters.
             | Almost nobody uses GPT2 today. It's too weak.
             | 
             | If you want to go with <1B model, you use a BERT which is
             | bidirectional or a T5 that is easier to fine-tune on other
             | tasks.
        
             | fredoliveira wrote:
             | Something that immediately comes to mind is text
             | summarization. You'll by now be used to better results from
             | GPT-3 or recent models, though.
        
           | programmarchy wrote:
           | Does anyone know the main differences between GPT-2 and
           | GPT-3? Are there significant architectural changes, or is the
           | advancement primarily from training?
        
             | naasking wrote:
             | If you google "GPT-2 vs GPT-3" you'll find lots of
             | overviews and comparisons, like:
             | 
             | * https://www.kdnuggets.com/2021/02/gpt2-gpt3-openai-
             | showdown....
             | 
             | * https://bakztfuture.substack.com/p/the-chasm-between-
             | gpt-2-a...
        
               | programmarchy wrote:
               | Thanks. Sounds like they 10x'ed the number of parameters,
               | which made some "magic leap" that isn't yet well
               | understood, and fed it more data to train it on more
               | specialized domains.
        
               | naasking wrote:
               | Yes, although Chinchilla seems to imply that training
               | data size matters a lot more than parameter count, and
               | nanoGPT author is trying to reproduce that here:
               | 
               | https://github.com/karpathy/nanoGPT/blob/master/scaling_l
               | aws...
        
               | karpathy wrote:
               | I was also a bit surprised that the Chinchilla numbers
               | and tables don't reproduce and that there are calculation
               | bugs in the paper (e.g. the FLOPs calculation in the
               | paper is wrong), especially because the paper has been so
               | impactful in the field. Maybe people are focusing on the
               | broad themes of the paper (e.g. scale model and data
               | approx. in tandem) and just roughly interpolating the
               | main Figure, without sweating the details. The
               | corresponding authors responded very kindly at first and
               | I was able to bring the results closer but now they went
               | dark. Still hoping to make things match, if others in LLM
               | space can spot any issues in my own reproduction please
               | let me know.
        
               | programmarchy wrote:
               | Oh, that's really interesting, and makes sense
               | intuitively. From the abstract:
               | 
               | > We find that current large language models are
               | significantly under-trained, a consequence of the recent
               | focus on scaling language models whilst keeping the
               | amount of training data constant ... the model size and
               | the number of training tokens should be scaled equally:
               | for every doubling of model size the number of training
               | tokens should also be doubled.
               | 
               | Assuming the GPT-3 authors know this, one could surmise
               | they 10x'ed the number of training tokens also.
               | 
               | Edit: Should have kept reading. Sounds like GPT-3 was
               | found to be undertrained.
        
       | aravindgp wrote:
       | Thank you Andrej Karpathy for the work on ai and gpt models. It
       | really helped me solve a problem as entrepreneur. I started
       | making first few grand from ai.
        
         | srge wrote:
         | May i ask how? Consulting?
        
       | imranq wrote:
       | I would love to see a minInstructGPT or a minRetro, or maybe
       | something that combines instruction and retrieval into a readable
       | codebase!
        
       | sharemywin wrote:
       | To me this is the important quote:
       | 
       | Unlike OpenWebText this will run in seconds. Finetuning takes
       | very little time, e.g. on a single GPU just a few minutes. Run an
       | example finetuning like:
        
       | bilsbie wrote:
       | Really cool. Can anyone answer these questions:
       | 
       | Should I use this or minGPT?
       | 
       | It says it needs 8XA100 40GB node. What is that and where do I
       | acquire it?
       | 
       | Could someone else train this and then send me the model? What
       | would be required to run it as opposed to training it?
        
         | vanpelt wrote:
         | A100's are Nvidia GPU's. You can rent them from providers like
         | AWS or LamdaLabs. The readme has instructions for downloading
         | the original GPT2 weights from OpenAI. You can also train a
         | very simple version on a smaller dataset from your laptop as
         | described in the README.
         | 
         | If you just want to play with a similar but much better model
         | goto https://chat.openai.com
        
       | nprateem wrote:
       | If I trained this on a 30,000 word document could it give me a
       | summary? Or would there be no need to train it in that case, and
       | I could just tell it "Summarise this: <insert 30,000 word
       | document>"?
        
         | londons_explore wrote:
         | The context window (block size) of this model is 1024 symbols.
         | Symbols approximately map to words. So you can't ask it to
         | summarize anything over 1024 words.
        
           | nprateem wrote:
           | Yeah that's the issue I was thinking of, how to get it to
           | summarise large documents. Has anyone any ideas?
        
             | londons_explore wrote:
             | People have had some success with the following process:
             | 
             | Divide your 30,000 word document into a hundred 300 word
             | chuncks. For each chunk, give as input:
             | Please summarize the following text into  50 words:
             | [chunk]
             | 
             | Join all the outputs together, and you now have a shorter
             | document. Repeat the process recursively.
             | 
             | You can improve the results by doing the process again, but
             | this time giving some context:                   Please
             | summarize the following text, an extract of a document
             | about [1st attempt at a summary], into  50 words:
             | [chunk]
        
               | londons_explore wrote:
               | You can also use "Please suggest a section title for the
               | following text".
               | 
               | Then that title can be used in the 2nd round, for example
               | using a query of the form "The following is an extract
               | from the _Introduction_ section of a document about _The
               | benefits and disadvantages of nuclear power in sweden_ :"
        
               | generalizations wrote:
               | I imagine you could do even better by finetuning the
               | neural net on the document before asking for the
               | recursive summary. Then it has all the information to
               | work with, albeit in a compressed form.
        
         | londons_explore wrote:
         | 30,000 words wouldn't be enough to train this from scratch -
         | you'd ideally train from hundreds of millions of words at
         | least.
         | 
         | 30,000 words _would_ be enough to _finetune_ an existing model.
         | If you did that, then the model would output text similar to
         | the finetuning data. For example, if you finetuned it on
         | shakespeare, then you might be able to use the model to make a
         | new play, in shakespeare 's style.
        
           | ProjectArcturis wrote:
           | If you finetuned it on the text of Shakespeare's plays, how
           | would it link that text to the string "Shakespeare"?
        
             | londons_explore wrote:
             | It still has the knowledge from the main training on data
             | from across the whole internet, so would still know the
             | word Shakespeare...
             | 
             | But you're right - the model finetuned on shakespeare would
             | be good at writing a new play in the style of shakespeare,
             | but would be bad at giving a critique of shakespeare's
             | works.
        
       | gpt-4 wrote:
       | Is there a list of datasets like
       | https://skylion007.github.io/OpenWebTextCorpus/ ?
        
       | grogenaut wrote:
       | Somewhat off topic, does someone know how bing might integrate
       | chat gpt into search. Is it to understand the prompt and filter
       | results. Taking the question and summarizing it to search the
       | index. Is it to summarize all the documents into an index and
       | search that. Or to just be like chat gpt is now and use it to
       | generate new results from it's knowledge base? I'm trying to
       | connect the dots between a generative form like these are and how
       | it would influence search in the future. Or is the lucene style
       | index search on it's way out in a generative world?
        
       | legutierr wrote:
       | > The code itself is plain and readable: train.py is a ~300-line
       | boilerplate training loop and model.py a ~300-line GPT model
       | definition, which can optionally load the GPT-2 weights from
       | OpenAI. That's it.
       | 
       | What's the best source for these weights?
        
         | benjamincburns wrote:
         | Kaggle or HuggingFace
        
       | siquick wrote:
       | Excuse my ignorance but what can a layman do with this?
        
         | taneq wrote:
         | Become less lay?
        
       | cs702 wrote:
       | Andrej doesn't _need_ to do this.
       | 
       | He's done it because he evidently _loves_ it, and wants to
       | _share_ his hard-earned knowledge with the rest of the world.
       | 
       | He may be a product of the ivory tower, but he's been _in the
       | trenches_. He knows firsthand how f-ing hard it is to ship a
       | product.
       | 
       | And here he is, sharing useful _personal code_ with everyone.
       | 
       | This github repo now has collected ~4K stars and its antecessor
       | (minGPT) has collected ~11K stars over the past couple of years.
       | In my experience, the number of people who clone, copy, view or
       | otherwise use code from a repo is one to two orders of magnitude
       | larger than the number of people who star it, so we can safely
       | say that Andrej has helped at least a few hundred thousand -- but
       | likely more than a million -- individuals around the world learn
       | how to build and tinker with GPT models.
       | 
       | Remarkably, as I write this, no one else here has said thank you
       | yet, so let me say it on everyone's behalf:
       | 
       | THANK YOU ANDREJ.
       | 
       | --
       | 
       | EDITS: I changed the wording in response to latexr's comments
       | below.
        
         | canadianfella wrote:
         | [dead]
        
         | LeoPanthera wrote:
         | I love _italics_. They 're _good_.
        
           | cs702 wrote:
           | In hindsight, yes, I may have overused them out of
           | excitement. Sorry! :-)
        
         | mrg3_2013 wrote:
         | Thoughtful post! Everything so true! I am always amazed by
         | individuals who truly are educators of the world.
        
         | isoprophlex wrote:
         | Pedantry time!
         | 
         | A million people building GPT models means that one in 8000
         | humans on earth has built one. That seems wildly off.
         | 
         | Linkedin has about 100.000 profiles of data scientists. Assume
         | generously that the actual number is 10x higher. Not correcting
         | for the fact that a data scientist isn't always a machine
         | learning expert, etc etc, there's just no way every single one
         | of them even KNOWS what a GPT-like model is.
        
           | cs702 wrote:
           | Not only building. Also tinkering, using, testing out of
           | curiosity, etc. There are around ~30 million software
           | developers worldwide today (source: googled it). Around ~7
           | million of them are registered users of Github (source:
           | googled it). 1M+ seems likely to me.
           | 
           | BTW, I appreciate that you preceded your comment with
           | "Pedantry time!" -- nice gesture :-)
        
         | latexr wrote:
         | Edit: the OP has updated their wording to make it clear they
         | meant any kind of viewing or usage. I don't think any of us
         | would disagree more people use code than star repos. Original
         | comment left below with original quote, since this has gotten a
         | number of replies that would stop making sense with a larger
         | edit.
         | 
         | > Normally, the number of people who clone or copy code from a
         | repo is one to two orders of magnitude larger than the number
         | of people who take the time to star it
         | 
         | Intuitively, I'm having trouble believing that. Starring takes
         | _considerably_ less effort than cloning or copying code. The
         | "time to star" is a literal second, maybe two if you have to
         | scroll up.
         | 
         | From anecdotal observation, repos with more forks and/or
         | external contributors than stars are far from the norm. I've
         | seen many mentioning they star repos as a way of bookmarking
         | they seldom go back to, or as an easy way to send kudos to the
         | developer even when they don't use the project.
         | 
         | In no way is this a comment on the value of Andrej's work (I'm
         | not familiar with it). I am only interested in the source of
         | your "orders of magnitude" claim, which if proven will update
         | my mental model of the coding community.
        
           | chirau wrote:
           | How many projects have you starred and how many have you
           | cloned?
           | 
           | Whilst starring is simpler, the incentive is much lower than
           | that of cloning. Especially for projects you just want to use
           | and not contribute to or follow.
           | 
           | In my many years of work, i have only starred less than 50
           | repos. I am sure i have cloned more than a thousand.
        
             | latexr wrote:
             | > How many projects have you starred and how many have you
             | cloned?
             | 
             | I seldom star, but neither you nor I can be extrapolated to
             | the general community. I have thousands of stars in some
             | repos, and I know a significant number of those users don't
             | even code, let alone clone repos or copy code, they're
             | interested in the final product. They have GitHub accounts
             | because it's the way to report bugs or make feature
             | requests.
             | 
             | The OP made a claim. All I'm looking to understand is if it
             | has data backing it up or it's just a gut feeling, because
             | if it's the former I'll have learned something and made a
             | correction of my mental model of the world. Sharing more
             | anecdotes will leave us stuck in the same situation.
        
           | wongarsu wrote:
           | If I want to use a repository, my first step is to either
           | download a released binary or clone the repository. Forking
           | is much further down the line for me, when I've used the
           | code, encountered a problem, fixed it, and decided to polish
           | the fix up to make a PR. I star something when I either have
           | used it and like it, or when I think I want to use it in the
           | future and want to bookmark it (though the former more often
           | than the latter). I have given out about 50% more stars than
           | I've forked, and have probably cloned an order of magnitude
           | more than I've forked or starred.
           | 
           | Of course not everyone is the same, but I'd be surprised if
           | overall clones were less than an order of magnitude more than
           | forks or stars, and find two or even three orders of
           | magnitude believable depending on the target group of the
           | repo.
        
             | cs702 wrote:
             | _Exactly._ I would add that the number of clones (not
             | forks) and file /page views is viewable only by the owner
             | of the repo, so we can only guess. (If you own a github
             | repo, you can see the most recent number of clones and page
             | views by clicking on insight -> traffic.)
             | 
             | My estimate of "one to two orders of magnitude" is based on
             | anecdotal evidence. I edited my comment to reflect as much.
        
           | sjadoinqwoeihad wrote:
           | I checked my 5 year old repository of ~300 stars. It has a
           | ~100 unique clones a month. So if the average was half of it
           | then the 1 order of magnitude would be quite an accurate
           | approximation.
           | 
           | I think the biggest difference with a clone and a star is
           | that a star requires an account and some vested interest in
           | the social network of Github. Anyone who is not interested in
           | the social aspect can just bookmark it.
           | 
           | I guess this differs quite a lot by target demographic. A
           | tool for GPT will probably get a lot more stars than a plugin
           | for some consumer software simply because it is more targeted
           | for the audience of people who have Github accounts.
        
             | [deleted]
        
             | cs702 wrote:
             | Thank you for sharing your anecdata. In my experience, the
             | number of clones per month is much higher at first, and
             | then decays gradually until it settles into a stable run-
             | rate, so it's likely that you've had _more than_ 100 x 12 x
             | 5 clones over those five years -- i.e., between one and two
             | orders of magnitude the number of stars, 300.
        
               | jefftk wrote:
               | Another data point: icdiff is 13y old with 4k stars and
               | 200 unique clones in the past month.
               | 
               | (This is a tool that most people install and run without
               | any interaction with GitHub, since it is in package
               | managers)
        
           | londons_explore wrote:
           | Some repos have code that 'phones home' when run. For
           | example, checking for updates or security vulnerabilities.
           | 
           | By checking the usage statistics on that server, you can get
           | an idea how many users there are, and typically it's far
           | higher than the number of stars.
        
             | latexr wrote:
             | That just tells us that more people _use_ the code than
             | star the repo. I don't think that'd be a surprise to
             | anyone. The claim was that more people clone and copy code
             | from the repo than the ones who star it, which is a
             | different matter from the number of users.
        
               | cs702 wrote:
               | Thank you for clarifying. I meant _use_. The number of
               | clones and the number of file /page views are proxies for
               | that. So is the number of installs via pip, conda, and
               | other Python package management systems, in this case. I
               | updated my comment to reflect as much.
        
           | hahamrfunnyguy wrote:
           | I've stared maybe 2-3 repositories over the past 15 years,
           | contributed to probably a half dozen and used hundreds (if
           | not more) in my applications. To me using means using that
           | project in an application you develop. Typically I get them
           | from NPM or Nuget and I contribute when a) the project owner
           | thinks my feature idea is a good idea or b) I run into a bug
           | that I can fix.
           | 
           | Starring is just not that useful to me so I can see why users
           | or contributors would be much higher. I typically star repos
           | if it's an unpopular or old repository that doesn't have NPM
           | or Nuget packages.
        
           | [deleted]
        
         | adam_arthur wrote:
         | I'm all for thanking open source contributors, but your
         | excessively prostrating wording is a bit much for me.
        
           | cs702 wrote:
           | If I overdid it, I'm sorry. I promise it wasn't intentional.
           | My comment was spur-of-the-moment, motivated by sincere
           | gratitude :-)
        
           | 1986108 wrote:
           | [flagged]
        
             | idiotsecant wrote:
             | can't tell if you're making some kind of clever quip or if
             | this is some random spambot just entering a random reverse
             | DNS lookup line.
        
         | tomComb wrote:
         | Him doing this is not like when your average bloke does it.
         | 
         | He appears to be building a business and maintaining his
         | profile. And there is nothing wrong with that - I admire him
         | for for pursuing his career in this positive and helpful way.
         | 
         | But random folks do this sort of thing everyday with no such
         | career goals and little recognition, so I'm not sure it is this
         | specific contribution that needs to be called out.
        
           | krisoft wrote:
           | > I'm not sure it is this specific contribution that needs to
           | be called out.
           | 
           | I go the other way. I would like to thank anyone who releases
           | open source code, whether they cause big ripples or not.
        
           | modeless wrote:
           | What business is he building?
        
       | homarp wrote:
       | see also 'Cramming: Training a Language Model on a Single GPU in
       | One Day' https://arxiv.org/abs/2212.14034 and
       | https://github.com/JonasGeiping/cramming
        
       | waiseristy wrote:
       | So, are there any of these projects that aren't vendor locked to
       | NVIDIA and are able to train large models with limited GPU RAM
       | space?
       | 
       | I don't mind letting my machine churn for 2-3 weeks. But I'm not
       | looking to buy another 1000$ GPU just because CUDA is the only
       | compute library researchers understand
        
       | jgalt212 wrote:
       | So is MSFT now extra grossly overpaying for ChatGPT?
        
       | surume wrote:
       | Thank you so much for this! It is so impressive and I'm sure it
       | took a lot of hard work!
       | 
       | Is it able to re-write articles? And where could I find a guide
       | on how to train it?
        
       | arturventura wrote:
       | This is really good, and I was really excited by it but then I
       | read:
       | 
       | > running on a single 8XA100 40GB node in 38 hours of training
       | 
       | This is a $40-80k machine. Not a diss, but I would love to see an
       | advance that would allow anyone with a high end computer to be
       | able to improve on this model. Before that happens this whole
       | field is going to be owned by big corporations.
        
         | windexh8er wrote:
         | But how often do you need to run this? You can run 8xA1000 on
         | LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be
         | able to run the entire data set for less than $350.
         | 
         | [0] https://lambdalabs.com/service/gpu-cloud#pricing
        
           | throwawaymaths wrote:
           | They are acknowledged at the bottom for supporting andrej's
           | research!!
        
         | anigbrowl wrote:
         | Well, he does include instructions for running it on a personal
         | computer, which looks like what I'm gonna be doing next week.
         | 
         | Besides the rental options discussed below these nvidia boxen
         | don't look too big so either used ones will be available for
         | cheap relatively soon, or you could just locate and liberate
         | one in Promethean fashion.
        
         | ProjectArcturis wrote:
         | That's to train it from scratch, though, right? If you preload
         | the GPT2 weights you don't need to do this. You can just give
         | it additional training on your texts.
        
         | aidos wrote:
         | I don't know anything about this, but is that this instance
         | type on AWS? p4d.24xlarge
        
         | Tepix wrote:
         | If you can fit the training into 24GB, a used RTX 3090 for
         | $700-$800 seems like a good deal at the moment. They are about
         | 45-65% as fast as the A100 according to https://bizon-
         | tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...
         | 
         | So if you buy two of these cards it will take 12-13 days
         | instead of 38 hours but only require a $2500 PC.
         | 
         | James Betker, who created tortoise TTS, built his own $15k
         | machine with 8x RTX 3090 and trained the models with it. He now
         | works for OpenAI...
        
         | jph00 wrote:
         | A couple of weeks ago a new paper came out that shows how to
         | train a high quality language model on a single GPU in one day.
         | 
         | https://arxiv.org/abs/2212.14034
        
         | wongarsu wrote:
         | It's a $33/hour machine on AWS, so about $1250 for one training
         | run. Not cheap, but easily in the reach of startups and
         | educational or research institutions.
         | 
         | Edit: or about $340 if you get the 8xA100 instance from
         | lambdalabs, in the realm of normal hobby spending
        
           | belter wrote:
           | Or $9/hour if you use Spot :-)
           | 
           | https://aws.amazon.com/ec2/spot/pricing/
        
             | snerbles wrote:
             | Hopefully your progress gets saved in time when the spot
             | instance inevitably gets terminated in the midst of
             | training.
        
               | acetabulum wrote:
               | If you use Horovod Elastic, I think you can avoid this
               | problem working across a cluster of Spot instances.
               | 
               | https://horovod.readthedocs.io/en/stable/elastic_include.
               | htm...
        
               | belter wrote:
               | "Managed Spot Training..."
               | 
               | "...Spot instances can be interrupted, causing jobs to
               | take longer to start or finish. You can configure your
               | managed spot training job to use checkpoints. SageMaker
               | copies checkpoint data from a local path to Amazon S3.
               | When the job is restarted, SageMaker copies the data from
               | Amazon S3 back into the local path. The training job can
               | then resume from the last checkpoint instead of
               | restarting...."
               | 
               | https://docs.aws.amazon.com/sagemaker/latest/dg/model-
               | manage...
        
           | bobbyi wrote:
           | If you're doing something new/ custom (which you presumably
           | are if you aren't using someone else's prebuilt model), it
           | could take a lot of runs to figure out the best training data
           | and finetune settings.
           | 
           | (I assume. I've never worked with GPT, but have done similar
           | work in other domains).
        
           | weird-eye-issue wrote:
           | After training don't you have to keep it running if you want
           | to use it?
        
             | wongarsu wrote:
             | Just download the model and run it on something much
             | smaller and cheaper. Bigger models like GPT-J are a bit of
             | a pain to run, but GPT2-sized models run just fine on
             | consumer GPUs.
        
               | bilsbie wrote:
               | What's required to run the model?
        
               | wongarsu wrote:
               | The biggest GPT2 (1.5B params) takes about 10GB VRAM,
               | meaning it runs on a RTX 2080 TI, or the 12GB version of
               | the RTX 3080
        
               | renewiltord wrote:
               | What's the largest language model I can run on a 3090
               | with 24 GiB RAM?
        
               | lossolo wrote:
               | Depends on precision, you can run ~5B model with fp32
               | precision or ~11B fp16 model max. Int8 is really bad for
               | real world use case so not mentioning it.
               | 
               | But if you are looking to get performance of ChatGPT or
               | GPT-3 then don't waste your time, all GPT-3 like small
               | LLM models (below at least 60B params) are useless for
               | any real world use case, they are just toys.
        
               | renewiltord wrote:
               | Okay, thank you. Perfect response.
        
               | haldujai wrote:
               | If you specifically mean a general LLM trained on a
               | general language corpus with instruction finetuning this
               | is correct.
               | 
               | Fortunately very few real world use cases need to be this
               | general.
               | 
               | If you are training a LLM on a domain specific corpus or
               | finetuning on specific downstream tasks even relatively
               | tiny models at 330m params are definitely useful and not
               | "toys" and can be used to accurately perform tasks such
               | as semantic text search, document summarization and named
               | entity recognition.
        
               | lossolo wrote:
               | > If you specifically mean a general LLM trained on a
               | general language corpus with instruction finetuning this
               | is correct.
               | 
               | Yes, that's what I meant.
               | 
               | > If you are training a LLM on a domain specific corpus
               | or finetuning on specific downstream tasks even
               | relatively tiny models at 330m params are definitely
               | useful and not "toys" and can be used to accurately
               | perform tasks such as semantic text search, document
               | summarization and named entity recognition.
               | 
               | Agree, BERT family is a good example here.
        
         | JustSomeNobody wrote:
         | https://github.com/karpathy/nanoGPT#i-only-have-a-macbook
         | 
         | > This creates a much smaller Transformer (4 layers, 4 heads,
         | 64 embedding size), runs only on CPU, does not torch.compile
         | the model (torch seems to give an error if you try), only
         | evaluates for one iteration so you can see the training loop at
         | work immediately, and also makes sure the context length is
         | much smaller (e.g. 64 tokens), and the batch size is reduced to
         | 8. On my MacBook Air (M1) this takes about 400ms per iteration.
         | The network is still pretty expensive because the current
         | vocabulary is hard-coded to be the GPT-2 BPE encodings of
         | vocab_size=50257. So the embeddings table and the last layer
         | are still massive. In the future I may modify the code to
         | support simple character-level encoding, in which case this
         | would fly. (The required changes would actually be pretty
         | minimal, TODO)
        
         | anilshanbhag wrote:
         | If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 /
         | chatGPT needs!
        
           | Gigachad wrote:
           | Supposedly even running the trained model for ChatGPT is
           | extremely expensive unlike the image generators which can
           | largely be run on a consumer device.
        
         | haldujai wrote:
         | If you can't fit the model on your resources you can leverage
         | DeepSpeed's ZeRO-offload which will let you train GPT2 on a
         | single V100 (32gb).
         | 
         | Alternatively, if you're researching (with the caveat that you
         | have to either publish, open source or share your results in a
         | blog post) you can also get access to Google's TPU research
         | cloud which gives you a few v3-8s for 30 days (can't do
         | distributed training on devices but can run workloads in
         | parallel). You can also ask nicely for a pod, I've been granted
         | access to a v3-32 for 14 days pretty trivially which (if
         | optimized) has more throughput than 8xA100 on transformer
         | models.
         | 
         | TPUs and moreso pods are a bit harder to work with and TF
         | performs far better than PyTorch on them.
         | 
         | https://www.deepspeed.ai/tutorials/zero-offload/
         | 
         | https://medium.com/analytics-vidhya/googles-tpu-research-clo...
        
         | dceddia wrote:
         | I was curious about how much this would be to rent, because
         | definitely the cost of those servers is outside the budget!
         | Lambda has 8xA100 40gb for $8.80/hr:
         | https://lambdalabs.com/service/gpu-cloud#pricing
        
         | Tenoke wrote:
         | It seems as likely as people being able to build big automaker
         | level of cars just with tools in their garage. More compute is
         | going to keep producing better results at least for LLMs.
        
         | base698 wrote:
         | You can rent on AWS and other cloud providers.
        
           | liquidk wrote:
           | That is a key difference. You can't easily and cheaply rent
           | an auto factory, but you're starting to be able to rent an
           | LLM training factory once for a model where you can then more
           | cheaply run inference on.
        
           | krisoft wrote:
           | So if I see it right that would be a p4d.24xlarge instance.
           | Which goes for about $32.77 an hour nowadays so the total
           | training would be about $1245. Not cheap, but certainly not a
           | nation state budget.
           | 
           | Edit: i just noticed lambda lab. It seems they ask $8.8 per
           | hour for an instance of this caliber. That puts the total
           | training cost around $334. I wonder how come it is that much
           | cheaper.
        
         | pavlov wrote:
         | I don't know if that's a blocker. Ordinary people commonly rent
         | a $40k machine for 38 hours from companies like Avis and Hertz.
         | 
         | If training a large model now costs the same as driving to
         | visit grandma, that seems like a pretty good deal.
        
           | [deleted]
        
           | jetrink wrote:
           | That's a great comparison. For a real number, I just checked
           | Runpod and you can rent a system with 8xA100 for $17/hr or
           | ~$700 for 38 hours. Not cheap, but also pretty close to the
           | cost of renting a premium vehicle for a few days. I've
           | trained a few small models by renting an 1xA5000 system and
           | that only costs $0.44/hr, which is perfect for learning and
           | experimentation.
        
             | willseth wrote:
             | The good news is that, unlike vehicles, the rate for rented
             | compute will continue to drop
        
             | amelius wrote:
             | It would be great if a tradeoff could be made, though. For
             | example, train at 1/10th the speed for 1/10th of the cost.
             | 
             | This could correspond to taking public transport in your
             | analogy, and would bring this within reach of most
             | students.
        
               | mcbuilder wrote:
               | Well if it used to cost you $1 for 1hr at 1x speed, now
               | it will take you 10hr at 0.1x speed, and if my math
               | checks out $1. You need to shrink the model.
        
               | amelius wrote:
               | But of course now you run it on your own computer instead
               | of in the DC, which changes the numbers. Especially if
               | your student dorm has a shared electricity bill :)
        
               | mk_stjames wrote:
               | The problem with that is currently, the available memory
               | scales with the class of GPU.... and very large language
               | models need 160-320GB of VRAM. So, there sadly isn't
               | anything out there that you can load up a model this
               | large on except a rack of 8x+ A40s/A100s.
               | 
               | I know there are memory channel bandwidth limits and
               | whatnot but I really wish there was a card out there with
               | a 3090 sized die but with 96GB of VRAM solely to make it
               | easier to experiment with larger models. If it takes 8
               | days to train vs. 1, thats fine. having only two of them
               | to get 192GB and still fit on a desk and draw normal
               | power would be great.
        
               | buildbot wrote:
               | Technically this is not true- there are a lot of
               | techniques to shard models and store activation between
               | layers or even smaller subcomponents of the network. For
               | example, you can split the 175B parameter bloom model
               | into separate layers, load up a layer, read the prev.
               | layers input from disk, and save the output to disk.
               | 
               | And NVIDIA does make cards like you are asking for - the
               | A100 is the fast memory offering, the A40 the bulk slower
               | memory (though they added the 80GB A100 and did not
               | double the A40 to 96GB so this is less true now than the
               | P40 vs P100 gen).
               | 
               | Oddly, you can get close to what you are asking for with
               | a M1 Mac Studio - 128GB of decently fast memory with a
               | GPU that is ~0.5x a 3090 in training.
        
               | amelius wrote:
               | I guess this would only become a reality if games started
               | requiring these cards.
        
               | londons_explore wrote:
               | Slower training tends to be only a little cheaper,
               | because most modern architectures parallelize well, and
               | they just care about the number of flops.
               | 
               | If you want to reduce cost, you need to reduce the model
               | size, and you'll get worse results for less money.
        
           | ofcourseyoudo wrote:
           | Similarly maybe we should only let people rent a NanoGPT box
           | if they are over 25 and they have to get collision insurance.
        
           | swader999 wrote:
           | You have to gas it up and heaven help you if it gets a
           | scratch or a scuff.
        
             | speed_spread wrote:
             | Great news! Cloud instances energy usage is included in
             | their price, and because they're remote and transient it's
             | impossible to permanently damage them.
        
               | DesiLurker wrote:
               | but you still have to pay for network ingress/egress
               | traffic.
        
               | aequitas wrote:
               | I think the equivalent of being not careful and getting a
               | dent in this context is to leave it open to the internet
               | and having a bitcoin miner installed.
        
               | idonotknowwhy wrote:
               | A better fit would be, if you have unlimited liability
               | like with AWS, and you leak your key pair. Then someone
               | runs up a 100k bill setting up mining instances
        
               | Aissen wrote:
               | You free the instance and the miner is gone.
        
               | iso1631 wrote:
               | As you are paying for the resources you use that's fine.
               | 
               | The closest would be if you used some form of software
               | bug to actually cause physical damage, certainly not
               | impossible, but extremely unlikely compared with actually
               | physically damaging a car.
        
           | Apofis wrote:
           | Let's not forget that rendering 3D Animations in 3DSMAX or
           | Maya used to take days for a single frame for a complex
           | scene, and months for a few minutes.
        
         | kzrdude wrote:
         | How are universities and colleges dealing with this kind of
         | demand for computing power? It must be hard to be able to do
         | some courses now.
        
           | CuriouslyC wrote:
           | Most decently large colleges have been investing in HPC for a
           | while, and started investing in GPU HPC around 2014. You'd be
           | surprised what sort of school projects the compute budget
           | exists for.
        
             | r3trohack3r wrote:
             | I went to a smallish state university, even there we had
             | our own HPC center and lab. We had a proper HPC (IIRC) 6
             | row data center across campus and we had a continuous
             | budget available to me as an undergraduate research
             | assistant for building beowulf clusters for the graduate
             | programs to run assignments on. I once got an allowance to
             | buy 15 raspberry pis to build an arm cluster.
        
           | TrackerFF wrote:
           | As far as research groups go - they get funds (project
           | grants, donations, etc.) to purchase machines and parts, and
           | then users have to timeshare them.
           | 
           | These machines are pretty much crunching numbers 24/7, and
           | your project will get appended to a queue.
        
           | londons_explore wrote:
           | 'group project'
        
       | sebastianconcpt wrote:
       | What's the applicability? Can you give me some examples of what
       | can be used this for?
        
         | wongarsu wrote:
         | I imagine this might be interesting for domain-specific GPT
         | models. Say training it on a mountain of technical
         | documentation, or on every fanfiction published on the
         | internet, or a sentiment analysis dataset. Of course fine-
         | tuning GPT3 would give better results, but nanoGPT might allow
         | you to make a much smaller model that's still good enough, to
         | enable cheaper inference.
         | 
         | Also the opportunity to play around with all the parameters
         | fairly cheaply to find improvements. The todo section of the
         | readme gives a small taste of that. Making bigger models works
         | for OpenAI, but maybe the rest of us manage to make small
         | models just perform better instead.
        
       | albertTJames wrote:
       | Curious to know how close that training loop is to actual openai
       | code.
        
       | buzzdenver wrote:
       | For an AI noob like me: can you use spot instances to train
       | models? They are about 1/3rd the price on AWS compared to on
       | demand ones, so it'd make a significant difference.
        
         | yreg wrote:
         | Why not? This is the exact use case of what Spot instances seem
         | to be for. (Not hosting a service, but just calculating
         | something for yourself.)
        
         | satvikchoudhary wrote:
         | Yes you should use them. They can be taken away from you with 2
         | min notice. (It doesn't happen a lot in practice though. I have
         | been running a different instance for over a month. AWS doesn't
         | force you if they don't have to)
         | 
         | If you are going to run a long training job, ensure you are
         | creating checkpoints. Be sure to use persistent storage, EBS
         | and ensure that you check the option that it doesn't get
         | deleted if the instance is stopped, so your checkpoint remain
         | in the disk and you can easily restart.
         | 
         | I haven't tried it but prices here are much cheaper.
         | https://vast.ai/#pricing
        
         | belter wrote:
         | Yes you can. In Oregon you could eventually get this instance
         | at $9. I say eventually, because of course Spot allocation is
         | not guaranteed. ( And neither is On Demand ...but that is a
         | story for another day)
         | 
         | https://aws.amazon.com/ec2/spot/pricing/
        
       | drjuice wrote:
       | [flagged]
        
       | henkdehenker wrote:
       | Karpathy is such a boss!
        
       | yreg wrote:
       | Is there any trained model for text generation that you can run
       | locally yet?
        
         | throwaway743 wrote:
         | Plenty. Huggingface alone has a ton
        
         | deqwer wrote:
         | There's LAION working on open source[1] version of chatGPT
         | 
         | [1] https://github.com/LAION-AI/Open-Assistant
        
           | Metus wrote:
           | This should be way higher up.
        
           | turmeric_root wrote:
           | Though their roadmap doc says they're looking into finetuning
           | existing GPT-J/T5 models for this task. So you'll probably
           | want a 3090 (24GB VRAM) and at least 16GB of CPU RAM to run
           | inference if/when the project is complete.
        
         | wongarsu wrote:
         | GPT2 can be run locally (on a somewhat beefy consumer GPU)
        
           | karmajuney wrote:
           | Can you add some info on what consumer GPU would be needed
           | for this? Would a 3080 be able to handle this?
        
             | wongarsu wrote:
             | Assuming you get the 12GB version of the 3080. A 2080TI is
             | another option. Though you can reduce precision or use one
             | of the smaller GPT2 versions to run on smaller cards as
             | well.
        
           | minimaxir wrote:
           | The original GPT-2 small (the 124M one) can run on a CPU,
           | just slowly and not scalably.
        
       | iamflimflam1 wrote:
       | I think the link should be: https://github.com/karpathy/nanoGPT
        
         | [deleted]
        
       | taylorius wrote:
       | Wow, this is great. I can't wait for the video lecture,
       | transformers are an aspect of modern machine learning that I'm
       | not completely clear on. Andrej's lectures are brilliant - super
       | detailed, and really answer the detailed questions I always have.
       | Great stuff!
        
       | theGnuMe wrote:
       | How critical are training warmups and is an iteration here the
       | same as an epoch?
        
       | Terretta wrote:
       | 14 hours ago: https://news.ycombinator.com/item?id=34331919
       | 
       | Curious why HN didn't merge the submission as it usually does. Is
       | there a "no, submit this again" option?
        
         | eismcc wrote:
         | The other post probably didn't make it to the front page
        
       | marviel wrote:
       | I have taken several masters-level courses in Machine Learning --
       | and even with those credentials, I cannot recommend _enough_
       | Andrej 's youtube series, "Neural Networks: Zero to Hero". There,
       | he teaches you, from scratch, how to build everything from the
       | underlying automated gradient calculation system in pytorch, all
       | the way up to the slower version of this model - `MinGPT`.
       | 
       | [1]
       | https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...
       | 
       | (edit: self-promo: I'm currently working on a Typescript follow-
       | through of this same series of video lectures, if you want to
       | follow along with stronger types for explanation:
       | https://github.com/Marviel/lab-grad)
        
         | randoglando wrote:
         | How does it compare to fast.ai? As a engineer looking to learn,
         | what should I start with?
        
           | marviel wrote:
           | Both are good for different things.
           | 
           | Fast.AI is great, but it takes the top down, vs the bottom
           | up, approach. It takes you from a production-level black box
           | that you don't understand, down to the details. The benefit
           | there is you get good high-level intuition of how it behaves
           | at the "let me use this technology for a job" level.
           | 
           | Separately, the fast.ai library is also highly recommendable
           | -- it comes with some state-of-the-art image recognition
           | models, and its training wrappers are really helpful
           | particularly for image-recognition dataset training.
           | 
           | Karpathy's "Neural Networks: Zero to Hero" video series
           | starts at the level of individual neurons, and works you up
           | to the final product. For some reason both this style, and
           | Karpathy's conciseness appeal to me slightly more. I'm also
           | super detail-oriented, though -- and any level of "hand
           | waving" (even if further explanation comes later) always
           | bothers me. He's also got some pretty high-profile industry
           | experience which carries some weight with me.
           | 
           | But I'll say that both are really high-quality. --
           | ultimately, my recommendation would be to follow whichever
           | one speaks most to you personally after the first 1hr or so.
           | 
           | EDIT: Per Jeremy's response below, if you want the bottom-up
           | approach but like the fast.ai teaching style, you should
           | check out "part 2" of the fast.ai set of tutorials, which is
           | exactly that.
        
             | jph00 wrote:
             | fast.ai has both - the "part 1" section is top-down, and
             | the "part 2" section is bottom up. You can do part 2
             | without having done part 1. Part 2 starts with implementing
             | matrix multiplication from scratch, then backprop from
             | scratch, then SGD from scratch, etc.
             | 
             | There will be a new version of the part 2 course out in a
             | few weeks. It even covers stuff like random number
             | generation from scratch, convolutions from scratch, etc. It
             | gradually works all the way up to Stable Diffusion.
             | 
             | @karpathy's and the fast.ai lessons work well together.
             | They cover similar topics from different angles.
             | 
             | (I'm the primary creator of the fast.ai courses.)
        
               | marviel wrote:
               | That's awesome! I did not know that part 2 was structured
               | this way, and will check it out. Will be really neat to
               | see you teach stable diffusion.
               | 
               | Thanks for your work on fast.ai!
        
             | jwithington wrote:
             | Jeremy @ Fast.ai says he takes this pedagogical approach
             | because it's "proven" to be the best way to learn. He's
             | probably right, but I do find it confusing at times because
             | in the beginning you're just hitting ctrl + enter on a
             | IPYNB haha.
             | 
             | Maybe Karpathy's approach will speak to me more--thanks for
             | the recommendation!
        
         | brap wrote:
         | I can't believe I just spent 2 and a half hours glued to my
         | phone in bed watching this, for absolutely no reason other than
         | it was such an interesting intro (to a subject I'm already
         | familiar with). Thanks for the recommendation, and thanks
         | Andrej for making this!
        
       | jwithington wrote:
       | What would I google to figure out how to productionize the output
       | of this?
       | 
       | This repo trains a model--how would I prompt it and print the
       | generated output?
        
       | mittermayr wrote:
       | As someone who's been in software for almost 25 years now, I read
       | through this in amazement of how much new stuff still keeps
       | coming in. This industry never stops and that makes it such a
       | fascinating (but arguably harsh) world to be in.
       | 
       | Looking at this feels like seeing the source code of a 64k demo,
       | learning about Mode 13h and trying to replicate it in Turbo
       | Pascal.
       | 
       | And, much like the old days of graphics programming, there's a
       | good chance all of this knowledge will be mostly irrelevant soon,
       | as the abstraction layers tend to come quicker and quicker and
       | take care of the hard foundational work underneath. Then it'll be
       | many of us here discussing whether or not it was good to have
       | been with it from the start, to really get it, or whether playing
       | with the highly-abstracted components is all that's needed to
       | succeed with it.
       | 
       | Either way, super cool to see the pace here and I loved the "I
       | only have a macbook" section.
        
         | eismcc wrote:
         | It will be funny to look back from the future and think, wow,
         | how did we get anything done with only 40GB RAM
        
         | [deleted]
        
       | lossolo wrote:
       | > reproduces GPT-2 (124M) on OpenWebText, running on a single
       | 8XA100 40GB node in 38 hours of training
       | 
       | For comparison GPT-3 has more than 1000x more params (175B) and
       | training time was around 2 months on ~1500 V100 GPUs which costs
       | millions of dollars in cloud compute costs. Gopher with 280B
       | params was trained on 4096 TPU-v3 chips, Microsoft Megatron-
       | Turing NLG 530B trained on 2240 NVIDIA A100 cards (each card
       | costs ~15k USD). And the most mind blowing is PaLM from Google
       | with 540B params and trained on 6144 TPU v4, which costs around
       | 10-30M USD in cloud compute to train.
        
       | marsven_422 wrote:
       | [dead]
        
       | justusthane wrote:
       | This is a dumb question about language models in general, not
       | necessarily specific to NanoGPT: why is all the focus on
       | training? Can I download and run a pre-trained model locally?
       | Surely the specs required to run a model are much, much lower
       | than those required to train the model?
        
         | ausbah wrote:
         | inference can still be a bottleneck i think since you usually
         | load the whole thing into memory which is 32-64GB+ usually?
        
           | visarga wrote:
           | Language models range from 1 to 300+ GB when loaded. It
           | depends on how you load them, if you load in int8 you get 4x
           | reduction.
        
         | code_runner wrote:
         | I believe the training is where the architecture of the model
         | is most apparent. You can absolutely download plenty of pre-
         | trained models.
         | 
         | You will also _probably_ need to fine tune for a specific use
         | case, so a common approach is downloading a pre-trained model
         | and fine tuning.
         | 
         | I think including the "from scratch" tuning script is
         | educational more than anything else.
        
         | anon291 wrote:
         | If you're only using pre-trained models, it's going to be
         | harder to differentiate yourself. Training / specialization of
         | models is where the moat-building is (due to access to
         | different data sets / better ideas). By specializing /
         | training, more of the token limit can be used for generation
         | rather than prompting / better prompts can be made.
         | 
         | The lower the cost of training, the more profitable any
         | resultant business. You can even envision businesses that train
         | the model regularly to bring in new knowledge. The cheaper this
         | is, the more opportunities open up.
        
         | nerdponx wrote:
         | It's the equivalent of building from source versus downloading
         | a compiled binary.
         | 
         | Also you can perform "fine tuning" which means you start with a
         | trained model and train it further on your own data, allowing
         | you to customize the model for specific tasks.
        
       | swader999 wrote:
       | Would it be possible to take all my user manuals and past
       | customer Q&A and train on just on that to produce a customer
       | helper chat bot?
        
       | 1986108 wrote:
       | 638c7215
        
       | QuadrupleA wrote:
       | Doesn't huggingface have dozens of freely available pretrained
       | models like this (including various sized implementations of
       | GPT2) and isn't the source available on most if you wanted to
       | train them yourself?
       | 
       | All I see in the comments is praise for the author as a person,
       | so just wondering what's unique about this that's not available
       | elsewhere? 730 upvotes and counting, assuming I'm missing
       | something...
        
         | moneywoes wrote:
         | The shilling seems intense
        
         | minimaxir wrote:
         | Additionally, in terms of the streamlining nanoGPT porports,
         | HuggingFace's implementations play nice with optimization
         | techniques such as ONNX/TensorRT, which will give you better
         | performance than anything PyTorch-based even if minimal.
         | 
         | That doesn't mean an ONNX-ed nanoGPT won't be better, but the
         | field of optimized text generation isn't as new as people
         | claim.
        
         | visarga wrote:
         | This is a didactic implementation. If you read the HuggingFace
         | repo it is much more abstracted on account they implement many
         | models in the same codebase. It's not fast or big, just easier
         | to read and tweak.
        
         | isoprophlex wrote:
         | True, but the use cases arent the same. As he did before for
         | other models, he has a knack for distilling the code down to
         | beautiful, self-contained examples of high didactic value.
         | 
         | It's an order of magnitude easier to grok the basics from this
         | repo than from going through (admittedly more ergonomic or
         | performant or production-ready) huggingface repos.
        
       | brossinthuon wrote:
       | https://news.ycombinator.com/item?id=34336386
        
       | rsiqueira wrote:
       | I could not find any sample (prompt and results). Can anyone
       | provide samples of it's quality, even if it is in a narrow field
       | of knowledge or specific use case? I tried GPT2, GPT-J 6B and
       | GPT-NeoX 20B (implementation by Fabrice Bellard at
       | textsynth.com/playground.html) but I could not find any
       | production-quality scenario yet, only cherry-picked simple cases.
        
         | boredemployee wrote:
         | That's what I really miss to conclude if I should try it myself
         | or not.
        
         | visarga wrote:
         | At this model size quality is not worth discussing. It is
         | clearly another league from GPT-3.
        
           | lossolo wrote:
           | Indeed, it is like comparing the speech of a 2-year-old child
           | to that of a college professor.
        
       | awestroke wrote:
       | Are there any possible technologal or scientific leaps on the
       | horizon that would reduce training time by an order of magnitude
       | or more? GPT-3 took 355 years to train with incredibly expensive
       | hardware, which means small players have no chance to push the
       | state of the art
        
         | imtringued wrote:
         | As models get bigger less and less neurons get activated by any
         | given input. If you can somehow predict which neurons get
         | activated you can skip the vast majority of the computational
         | load. I have read a paper where they argued that only 0.5% of
         | the neurons are actually active in a 200 million parameter
         | model so you can get a 200x improvement just from that.
         | 
         | What this tells you is that there is very little money in
         | optimizing deep learning and that NVIDIA has made it very easy
         | to just throw more hardware at then problem.
        
           | CuriouslyC wrote:
           | This is hard a-priori, but fairly easy post-facto. Model
           | distillation isn't a common practice yet, but it has already
           | been demonstrated to be quite effective for specific use
           | cases.
        
             | visarga wrote:
             | Distillation works but somehow we see very few papers doing
             | it at this scale.
        
           | visarga wrote:
           | > argued that only 0.5% of the neurons are actually active in
           | a 200 million parameter model so you can get a 200x
           | improvement just from that
           | 
           | Yes, but you don't know which 0.5% depending on the input
           | text.
        
           | londons_explore wrote:
           | > very little money in optimizing deep learning
           | 
           | Oh - there are a _lot_ of people working on optimizing AI.
           | Amongst hobbyists, academia, and corporations alike.
           | 
           | The thing is, if you come up with a neat optimization that
           | saves 30% of compute for the same results, typically instead
           | of reducing your compute budget 30%, you instead increase
           | your model/data size 30% and get better results.
        
             | narrator wrote:
             | Jevon's paradox of data and AI. The more efficiently data
             | is used, the more demand their is for data.
        
               | antognini wrote:
               | Any state of the art model takes about three weeks to
               | train.
        
               | visarga wrote:
               | More an indication of human patience than task
               | difficulty.
        
           | WithinReason wrote:
           | Do you have a link to that paper by any chance? By "neurons"
           | did they mean weights or activations?
        
             | imtringued wrote:
             | Here is a GPU implementation.
             | 
             | https://ieeexplore.ieee.org/document/9635657
             | 
             | It is somewhere from 8x to 25x faster than doing dense
             | machine learning. The speedup was higher on the original
             | CPU implementation and the GPU paper mentions that if there
             | isn't enough shared memory on the GPU it will have to
             | switch to an algorithm that has more overhead.
             | 
             | By neurons I actually meant "nodes"
             | 
             | My comment is effectively a summary of this article:
             | https://www.kdnuggets.com/2020/03/deep-learning-
             | breakthrough...
             | 
             | Edit: There is a paper for sparse spiking gradient descent
             | promising a 150x improvement. I am not sure how practical
             | this is because spiking neural network hardware heavily
             | limits your model size but here it is:
             | 
             | https://arxiv.org/abs/2105.08810
        
         | omeysalvi wrote:
         | I think AI is going to go the way of the hard sciences where
         | the age of tinkerers making progress by leaps and bounds in
         | their basement is over and incremental progress is going to be
         | the domain of universities or large private companies that can
         | afford to throw money behind it. I would love to be proven
         | wrong and see radical shifts in how people approach these
         | problems. Seems like the cycle started and got to this point
         | way too soon for AI though
        
           | swalsh wrote:
           | Tinkerers can fine tune a model though. Unfortunately most
           | fine tuning seems to be outmatched at the next iteration of
           | the model.
        
           | mittermayr wrote:
           | My take on this is that (good) content is one of the bigger
           | problems still, particularly also who exactly the original
           | training data belongs to (or where it comes from). There's a
           | certain risk (we'll see with Github CoPilot soon) it will
           | slow down for a bit until the licensing issues are all sorted
           | out. This can only be solved (for now) by bringing in public
           | funding/data, which universities have always been a very good
           | proxy for. Which also means it (usually) should be open
           | access to the public, to some extent (and useful for the
           | garage folks to catch up a bit). But, once we're past that,
           | it'll be all about that giant body of pre-trained data,
           | securely kept within the next Facebook or Microsoft,
           | amounting to literal data gold (just much higher value at a
           | lot less weight).
        
         | make3 wrote:
         | small players will never have a chance to push the state of the
         | art, as whatever optimization there is will also be applied at
         | large scale with more money
        
           | cypress66 wrote:
           | A lot of SOTA comes from small players. It just isn't the
           | case for LLMs.
        
           | awestroke wrote:
           | Good point, but perhaps a leap could take small players into
           | territories of language models that are large enough to be
           | useful. GPT-3 crossed that threshold
        
           | hankman86 wrote:
           | Take a leaf from Seti@Home's book and try to come up with a
           | distributed, volunteer-based approach to training an open
           | source LLM. There is already an enormous amount of suitable
           | ML hardware on end user devices.
        
             | Der_Einzige wrote:
             | Huggingface actually recently did this, but I think it's
             | for inference on their giant models like BLOOM
        
         | belter wrote:
         | Model size does not necessarily correlates to quality of
         | results.
         | 
         | "Chinchilla (70B) Greatly Outperforms GPT-3 (175B) and Gopher
         | (280B)" - https://towardsdatascience.com/a-new-ai-trend-
         | chinchilla-70b...
        
           | Der_Einzige wrote:
           | I highly doubt this in practice on a large scale. Outside of
           | the common phenomena of "most large NNs are under trained"
           | and "less better data is sometimes better than more worse
           | data", there are no other obvious mechanisms to explain why a
           | smaller model with same or similar architecture would be
           | better than a larger one.
           | 
           | I claim instead that we are still hardly scratching the
           | surface with how we evaluate NLP systems. Also, some fields
           | have straight up trash evaluation schemes. Summarization and
           | ROGUE scores are totally BS and I find the claim that they
           | even correlate with high quality summaries suspect. I say
           | this with publications in the that subfield, so I have
           | personal experience with just how crummy many summarizes are.
        
             | WithinReason wrote:
             | _there are no other obvious mechanisms to explain why a
             | smaller model with same or similar architecture would be
             | better than a larger one._
             | 
             | Overfitting?
        
               | Der_Einzige wrote:
               | The consensus seems to be that the majority of LMs are
               | _undertrained_ not overfitting though.
        
           | espadrine wrote:
           | An interesting outcome of the nanoGPT repo is this struggle
           | to exactly match the Chinchilla findings[0], even after
           | discussing it with the authors.
           | 
           | A larger discussion is that the scaling laws achieve loss-
           | optimal compute time, but the pre-training loss only improves
           | predictions on the corpus, which contains texts written by
           | people that were wrong or whose prose was lacking. In a real
           | system, what you want to optimize for is accuracy,
           | composability, inventiveness.
           | 
           | [0]: https://github.com/karpathy/nanoGPT/blob/master/scaling_
           | laws...
        
         | noidiocyallowed wrote:
         | Work together and fuck up companies together. That's the way to
         | go.
        
           | [deleted]
        
           | abricq wrote:
           | Or how to apply communism to software engineering. I like
           | that.
           | 
           | More seriously, the risk that a few companies become _even
           | more_ powerful thanks to their restricted access to such NN
           | is very frightening. The worth is, without legal
           | restrictions, there is nothing that we can do against it. And
           | I doubt that legal restrictions come in the next months  /
           | years.
        
             | beepbooptheory wrote:
             | Well at that point, some people might have the crazy crazy
             | insight that no matter how big the model is, or how many
             | GPUs they have, it burns up all the same.
        
         | rjtavares wrote:
         | Small players should focus on applications of this tech.
         | 
         | We now know that whatever AI Models succeed in the future,
         | they'll be trained by a huge company and finetuned to a
         | specific use case. Small companies should be working on use
         | cases, and then just upgrade to the latest SOTA model.
        
           | varispeed wrote:
           | > Small players should focus on applications of this tech.
           | 
           | That sounds a bit condescending. We are probably at a point
           | from which the government should intervene and help establish
           | level playing field. Otherwise we are going to see a deeper
           | divide between multibillion businesses conquering multiple
           | markets and sort of neofiefdom situation. This is not good.
        
             | rjtavares wrote:
             | I'm not being condescending at all, we've learned that the
             | value in AI is in the applications. If you think government
             | should regulate the field, it should be to make AI Models a
             | commodity, like electricity.
        
             | tiborsaas wrote:
             | It's not that condescending, that's todays reality. Should
             | I feel entitled for $600k training time that may or may not
             | work? Do you think the government is a good actor to judge
             | if my qualifications are good enough to grant me resources
             | worth a house?
             | 
             | It's quite reasonable to make use of models already trained
             | for small players.
        
               | mschuster91 wrote:
               | > Do you think the government is a good actor to judge if
               | my qualifications are good enough to grant me resources
               | worth a house?
               | 
               | Governments already routinely do that for pharmaceutical
               | research or for nuclear (fusion) research. In fact,
               | almost _all_ major impact research and development was
               | funded by the government, mostly the military. Lasers,
               | microwaves, silicon, interconnected computers - all
               | funded by the US tax payer, back in the golden times when
               | you 'd get laughed out of the room if you dared think
               | about "small government". And the sums involved were
               | ridiculously larger than the worth of a house. We're
               | talking of billions of dollars.
               | 
               | Nowadays, R&D funding is way WAY more complex. Some
               | things like AI or mRNA vaccines are mostly funded by
               | private venture capital, some are funded by large
               | philanthropic donors (e.g. Gates Foundation), some by the
               | inconceivably enormous university endowments, a lot by
               | in-house researchers at large corporations, and a select
               | few by government grants.
               | 
               | The result of that complexity:
               | 
               | - professors have to spend an absurd percentage of their
               | time "chasing grants" (anecdata, up to 40% [1]) instead
               | of actually doing research
               | 
               | - because grants are time-restricted, it's rare to have
               | tenure track any more
               | 
               | - because of the time restriction and low grant amounts,
               | it's _very_ hard for the support staff as well. In
               | Germany and Austria, for example, extremely low paid
               | "chain contracts" are common - one contract after
               | another, usually for a year, but sometimes as low as half
               | a year. It's virtually impossible to have a social life
               | if you have to up-root it for every contract because you
               | have to take contracts wherever they are, and forget
               | about starting a family because it's just so damn
               | insecure. The only ones that can make it usually come
               | from highly privileged environments: rich parents or,
               | rarely, partners that can support you.
               | 
               | Everyone in academia outside of tenured professors
               | struggles with surviving, and the system ruthlessly
               | grinds people to their bones. It's a _disgrace_.
               | 
               | [1] https://www.johndcook.com/blog/2011/04/25/chasing-
               | grants/
        
               | tiborsaas wrote:
               | Pharmaceutical or nuclear research doesn't really
               | classify as "small scale" as this thread started. I know
               | there are massive amounts of money handed our by
               | governments to fund research, but for a 3 guy startup in
               | a garage that's probably hopeless. Public money is cursed
               | anyways, it's better not to touch it.
               | 
               | I've also read it at many places, that academic research
               | funding is way too misaligned. It's a shame, really.
        
             | googlryas wrote:
             | Do you think you'll get a global agreement on this? Or
             | would china just eat America's lunch then?
        
         | ignoramous wrote:
         | Yes, see DeepMind RETRO:
         | 
         | > _In our experiments on the Pile, a standard language modeling
         | benchmark, a 7.5 billion parameter RETRO model outperforms the
         | 175 billion parameter Jurassic-1 on 10 out of 16 datasets and
         | outperforms the 280B Gopher on 9 out of 16 datasets._
         | 
         | https://www.deepmind.com/blog/improving-language-models-by-r...
         | 
         | Though, there hasn't been much follow-up research on it (or
         | DeepMind is not publishing it).
         | 
         | Annotated paper:
         | https://github.com/labmlai/annotated_deep_learning_paper_imp...
        
           | espadrine wrote:
           | The research is still ongoing, although perhaps lower-profile
           | than what appears in the press.
           | 
           | RETRO did get press, but it was not the first retrieval
           | model, and in fact was not SOTA when it got published; FiD
           | was, which later evolved into Atlas[0], published a few
           | months ago.
           | 
           | [0]: https://github.com/facebookresearch/atlas
        
         | QuesnayJr wrote:
         | There are a couple of cases where small changes in the model
         | make training much quicker. For example, the currently leading
         | Go AI, KataGo, requires much less time to train than AlphaGo
         | did.
        
         | GistNoesis wrote:
         | Yes. There are plenty forward leaps, most of them are not new
         | and are just waiting to be integrated or released :
         | 
         | Let's pave the road for SkyNet hard lift-off :
         | 
         | -The first obvious one is use of external knowledge store, aka
         | instead of having to store facts in the neural weights where
         | they struggle, just store them in a database and teach your
         | neural network to use it. (This is also similar to something
         | like webgpt where you allow your network to search the web).
         | This will allow you to have a network of 1G parameters (and
         | external indexes of a few TB) that is as performant as a
         | network of 100G parameters, and with better scaling property
         | too. You can probably gain at least 2 orders of magnitude
         | there.
         | 
         | -The second leap is better architecture of your neural
         | networks, approximating transformer that are quadratic compute
         | by something that is linear compute (linformer) or n log n
         | compute (reformer) can get you an order of magnitude faster by
         | simply reducing your iteration time. Similarly using some
         | architectures based on sparsity can give you faster computation
         | (although some of the gains are reduced by lesser efficiency of
         | sparse memory access pattern). Using (analog bits) Diffusion to
         | Generatively PreTrain sentences at a time instead of token by
         | token. You can probably gain between 1 and 3 order of magnitude
         | here if you write and optimize everything manually (or have
         | your advanced network/compiler optimize your code for you)
         | 
         | -The third leap is reduced domain : You don't have a single
         | network that you train on everything. Training one network by
         | domain allows you to have a smaller network that compute
         | faster. But also it allows you to focus your training on what
         | matters to you : for example if you want to have a mathematics
         | network, its parameters are not influenced a lot by showing it
         | football pictures. There is at least 2 orders of magnitude
         | there.
         | 
         | -The fourth one is external tool usage. It's kind of related to
         | the first one but whereas in the first one is readily
         | differentiable, this one necessitate some Reinforcement
         | Learning (that's what decision transformer are used for).
         | 
         | -Compression : compress everywhere. The bottlenecks are memory
         | bandwidth related. Work in compressed form when relevant. One
         | order of magnitude
         | 
         | -Distributed training : Because the memory bandwidth of inside
         | a GPU is in the order of TB/s where as the transfer to the GPU
         | is in the order of 10GB/s. There is an advantage to have the
         | parameters reside on the GPU but there is a limited quantity of
         | memory in the GPU, so distributed training (something like
         | petals.ml) allows you to increase your memory bandwidth by
         | collaborating. So each actor can probably gain an order of
         | magnitude. Provided that they can keep bad actors away.
         | 
         | -Use free resources : The other day Steam had 10M users with
         | GPU waiting around doing nothing, just release a dwarf fortress
         | mod with prettier pictures and use the compute for more
         | important tasks.
         | 
         | -Remove any humans in the loop : it's faster to iterate when
         | you don't have to rely any human, either for dataset
         | construction or model building
         | 
         | :)
        
         | seydor wrote:
         | It should be no issue if it became massively parralelized a-la
         | SETI. I wonder when Wikimedia or Apache foundation will jump
         | into AI
        
           | yreg wrote:
           | Wikimedia and other organizations that deal with moderation
           | might want to keep this technology out of the hands of the
           | general public for as long as possible.
        
         | isthisthingon99 wrote:
         | How long does it take to train a human? It's useless for two
         | years then maybe it can tell you it needs to poop.
         | 
         | The breakthrough will be developing this equivalent in an
         | accessible manner and us taking care to train the thing for a
         | couple of decades but then it becomes our friend.
        
           | licebmi__at__ wrote:
           | Yes, but to be fair, the system that does the training really
           | sucks and doesn't scale.
        
             | cactusplant7374 wrote:
             | Neither does OpenAI. It costs so much and still delivers so
             | little. A human can generate breakthroughs in science and
             | tech that can be used to reduce carbon emissions. ChatGPT
             | can do no such thing.
        
               | VBprogrammer wrote:
               | What percentage of humans make meaningful contributions
               | to advancing science or technology? The overwhelming
               | majority of us are just worker bees servicing the needs
               | of the human population.
        
               | cactusplant7374 wrote:
               | I agree with you on this point. It's also arguable that
               | less people with a better education system could yield
               | the same result with less environmental impact.
               | 
               | But my point, poorly explained, is that whatever ChatGPT
               | is, it isn't original or creative thought as a human
               | would do it.
               | 
               | Chomsky's example (which is based off Turing): Do
               | submarines swim? Yes, they swim -- if that's what you
               | mean by swimming.
        
               | jsjohnst wrote:
               | > What percentage of humans make meaningful contributions
               | to advancing science or technology?
               | 
               | I'm a nobody that you've never heard of and I've arguably
               | made meaningful contributions. If that's true, don't you
               | think there could be way more people out there than you
               | or sibling commenter imply?
        
               | tlb wrote:
               | You can't know that. Currently, 8 billion humans generate
               | a few scientific breakthroughs per year. You'd have to
               | run several billion ChatGPTs for a year with zero
               | breakthroughs to have any confidence in such a claim.
        
               | mbrock wrote:
               | With billions of GPT output streams, how do you actually
               | discover and rank what's significant? Screen them through
               | some even more powerful models? I imagine it's like a
               | volcano eruption of text where some are absolutely
               | brilliant and most is worthless and finding the jewels is
               | even more demanding than generating it all.
        
               | tlb wrote:
               | Some theories are easily testable. For instance, ask it
               | to write some code to efficiently solve traveling
               | salesman problems, and then test the code on some sample
               | problems. You can score the quality of solutions and time
               | taken, and manually inspect the best ones.
        
               | cactusplant7374 wrote:
               | At this point there is no framework that suggests GPT
               | understands the underlying data. It can't assign meaning
               | as a human would. It can't consume hundreds of math
               | textbooks and learn the principles of math and then apply
               | them more broadly to science textbooks and research
               | papers. It can't even reliably add two numbers.
               | 
               | Yes, brute forcing with hard AI can produce many
               | thoughts. But the AI wouldn't know they are correct. It
               | couldn't explain why. Any discovery would only be
               | attributable to randomness. It wouldn't be learning from
               | itself and its priors.
        
               | naasking wrote:
               | > At this point there is no framework that suggests GPT
               | understands the underlying data. It can't assign meaning
               | as a human would.
               | 
               | Actually there are many indications that GPT understands
               | the data, because its output mostly makes sense. The
               | reason it can't assign meaning the way a human would is
               | because a human can correlate words with _other sensory
               | data_ that GPT doesn 't have access to. That's where GPT
               | creates nonsense.
               | 
               | Think carefully about what "understanding" means in a
               | mechanistic sense. It's a form of compression, and a few
               | billion parameters encoding the contents of a large part
               | of the internet seems like pretty good compression to me.
        
               | ivanbakel wrote:
               | GPT doesn't display understanding of purely abstract
               | systems, so I doubt it's an issue of lacking sensory
               | information. It can't consistently do arithmetic, for
               | example - and I think it would be presumptuous to insist
               | that sensory information is a prerequisite for
               | mathematics, even though that's how humans arrived at it.
        
               | naasking wrote:
               | It's not yet clear why it struggles with arithmetic. It
               | could be data-related, could be model-related, although
               | scaling both seems to improve the situation.
               | 
               | In any case, GPT could still understand non-abstract
               | things just fine. People with low IQ also struggle with
               | abstract reasoning, and IQ tests place GPT-3 at around
               | 83.
        
             | isthisthingon99 wrote:
             | I still think that this will be a major form of AI that is
             | accessible to the public at large and it will enable
             | productivity improvements at all levels.
             | 
             | I'm not joking, this is really something I think
             | will/should happen.
        
         | xur17 wrote:
         | Alternatively, are there ways to train on consumer graphics
         | cards, similar to SETI@Home or Folding@Home? I would personally
         | be happy to donate gpu time, as I imagine many others would as
         | well.
        
           | mryab wrote:
           | There absolutely are! Check out hivemind
           | (https://github.com/learning-at-home/hivemind), a general
           | library for deep learning over the Internet, or Petals
           | (https://petals.ml/), a system that leverages Hivemind and
           | allows you to run BLOOM-176B (or other large language models)
           | that is distributed over many volunteer PCs. You can join it
           | and host some layers of the model by running literally one
           | command on a Linux machine with Docker and a recent enough
           | GPU.
           | 
           | Disclaimer: I work on these projects, both are based on our
           | research over the past three years
        
           | alfor wrote:
           | The cost of moving data from one gpu to the next will destroy
           | performance.
           | 
           | The system are moving in the opposite direction (look at Dojo
           | architecture or TensTorrent)
           | 
           | The silver lining is that the cost of training will fall
           | substantially with those architecture that are not based in
           | reusing gpu.
        
         | breck wrote:
         | > Are there any possible technologal or scientific leaps on the
         | horizon
         | 
         | Yes. From 2017: "Prediction 4: The simplest 2D text encodings
         | for neural networks will be TLs. High level TLs will be found
         | to translate machine written programs into understandable
         | trees."
         | 
         | We have something coming out that is an OOM better than
         | anything else out there right now.
        
         | spi wrote:
         | What do you mean by "small players have no chance"? OpenAI was
         | founded in 2015, it used to be a "small player" which just got
         | things right and grew with it - we're not talking of Google or
         | Facebook investing a chunk of their billions cash. In Germany,
         | AlephAlpha has built their own supercomputer and are training
         | similar sized models. It's expensive for sure, but well in the
         | possibilities of startups. In France researchers trained the
         | similarly sized BLOOM model
         | https://huggingface.co/bigscience/bloom. They claim it cost
         | between $2 and $4 millions.
         | 
         | Sure, a single researcher can't replicate this at their
         | university, but even though OpenAI likes to publish it this
         | way, we're not really talking about research here. Research was
         | inventing the transformer architecture, this is just making it
         | bigger by (very smart) engineering choices. It's something
         | companies should do (and are doing), not researchers.
        
           | SilverBirch wrote:
           | OpenAI was founded in 2015 by a group of billionaires who
           | pledged $1Bn of funding. That is hardly a small scrappy start
           | up.
        
           | awestroke wrote:
           | Microsoft (using Azure DCs) built a supercomputer with 10,000
           | V100 GPUs exclusively for OpenAI. [0]
           | 
           | It is estimated that it cost around $5M in compute time to
           | train GPT-3.
           | 
           | OpenAI has received billions in investment prior to launching
           | GPT-3, including $1B from Microsoft in 2019.
           | 
           | [0]: https://blogs.microsoft.com/ai/openai-azure-
           | supercomputer/
        
           | nileshtrivedi wrote:
           | > we're not talking of Google or Facebook investing a chunk
           | of their billions cash
           | 
           | OpenAI had raised $1B from Microsoft in 2019 and used it to
           | train a 175B param model. Now, they have raised $10B and are
           | training GPT-4 with 1.5T params. GPUs are capital intensive
           | and as long as there are returns to bigger models, that's
           | exactly where things will go.
        
             | andy_ppp wrote:
             | Will 1.5T parameters be possible to run in the public way
             | GPT-3 is? I can't wait to see what happens with this much
             | learning!
        
             | awestroke wrote:
             | I can't find any source on the 1.5T params number. I'd love
             | to read more if you have any links to share. Thanks
        
               | wut42 wrote:
               | afaik, gpt-4 is mostly rumours so far, same thing for the
               | 1.5T number. gpt-4 is suerly coming.
        
               | wnkrshm wrote:
               | Maybe it will be called GPT-XP by then, with Microsoft
               | owning half of it.
        
               | belter wrote:
               | Looking forward to see GPT-4 recommending Linux and Libre
               | Office instead of Windows/Office as the logical choice
               | out of 250 IQ ML Model...
        
               | ben_w wrote:
               | In my imagination, OpenAI does what Bungie did when MS
               | bought them, and open-sources what used to be their crown
               | jewels.
               | 
               | That said, GPT-AlephOne only makes sense if there's a
               | preceding GPT-[?].
        
               | egorfine wrote:
               | They have got to release GPT-3.11 For Workgroups first.
        
               | awestroke wrote:
               | Or GPT-365
        
               | MikeDelta wrote:
               | Then they can bring back the talking paperclip, but this
               | time actually useful.
        
               | generalizations wrote:
               | It _could_ actually work. It would be an incredibly gutsy
               | move and I love it, and they 'd probably earn a lot of
               | respect. They'd get so much press for it. And if it held
               | up, it'd probably be one of the things that MS is
               | remembered for.
        
               | int_19h wrote:
               | Why not ask GPT itself what it wants to be called?
        
               | wut42 wrote:
               | Or GPT One.
        
               | taneq wrote:
               | GPT-10 will be evergreen and 'the last version of GPT'.
               | 
               | And then three years later GPT-11 will be required to run
               | the latest games.
        
           | orbifold wrote:
           | I am actually still unclear how AlephAlpha pulled that off
           | and who funds them, since they have a rather low profile
           | team.
        
           | hdjjhhvvhga wrote:
           | > we're not talking of Google or Facebook investing a chunk
           | of their billions cash.
           | 
           | On the contrary, in this thread we are are mainly talking
           | about that.
        
         | davidy123 wrote:
         | Could this be distributed? Put all those mining GPUs to work. A
         | lot of people like participating in public projects like this.
         | I would!
        
           | PartiallyTyped wrote:
           | In theory, yes. "Hogwild!" is an approach to distributed
           | training, in essence, each worker is given a bunch of data,
           | they compute the gradient and send that to a central
           | authority. The authority accumulates the gradients and
           | periodically pushes new weights.
           | 
           | There is also Federated Learning which seemed to start taking
           | off, but then interest rapidly declined.
        
           | naraga wrote:
           | Exactly. This is inevitable imho. There is no way people will
           | be ok to depend on few wall-gardened models.
        
           | dmit wrote:
           | >> GPT-3 took 355 years to train
           | 
           | > Could this be distributed? Put all those mining GPUs to
           | work.
           | 
           | Nope. It's a strictly O(n) process. If it weren't for the
           | foresight of George Patrick Turnbull in 1668, we would not be
           | anywhere close to these amazing results today.
        
             | CyberDildonics wrote:
             | Why would an O(n) algorithm not be able to be distributed?
        
             | davidy123 wrote:
             | I couldn't find any references to George Patrick Turnbull.
             | If that an ancestor of yours? If so, the comment seems
             | rather subjective.
        
               | taneq wrote:
               | They're being facetious about the '355 years to train'
               | thing. ;)
        
               | davidy123 wrote:
               | OK haha good one then. Mine was a bit too subtle.
        
         | pprotas wrote:
         | What does "355 years" mean in this context? I assume it's not
         | human years
        
           | mellosouls wrote:
           | Claimed here, so this is presumably the reference (355 GPU
           | Years):
           | 
           | https://lambdalabs.com/blog/demystifying-gpt-3
           | 
           | "We are waiting for OpenAI to reveal more details about the
           | training infrastructure and model implementation. But to put
           | things into perspective, GPT-3 175B model required 3.14E23
           | FLOPS of computing for training. Even at theoretical 28
           | TFLOPS for V100 and lowest 3 year reserved cloud pricing we
           | could find, this will take 355 GPU-years and cost $4.6M for a
           | single training run. Similarly, a single RTX 8000, assuming
           | 15 TFLOPS, would take 665 years to run."
        
             | dx034 wrote:
             | That's still including margins of cloud vendors. OpenAI had
             | Microsoft providing resources which could do that at much
             | lower cost. It still won't be cheap but you'll be way below
             | $5m if you buy hardware yourself, given that you're able to
             | utilize it long enough. Especially if you set it up in a
             | region with low electricity prices, latency doesn't matter
             | anyway.
        
           | Manfred wrote:
           | Cumulative hours spent across training hardware.
        
         | captainmuon wrote:
         | I wonder about this, too. OpenAI's biggest 'moat' is that their
         | model takes so much resources to train, not that their
         | algorithms are particularly secret.
         | 
         | One idea I had was to not use one single model to learn all
         | steps of the task, but to break it up. The human brain has
         | dedicated grammar processing parts. It is unclear whether
         | something like a universal grammar exists, but we have at least
         | an innate sense for rhythm. Applied to NLP, you could heavily
         | preprocess the input. Tokenize it, annotate parts of speech.
         | Maybe add pronunciation, so the model doesn't have to think
         | about weird english spelling rules, and so you can deal with
         | audio more easily later. So I would build all these little
         | expert-knowledge black boxes and offer them as input to my
         | network.
         | 
         | But there is also some inherent resource cost in large language
         | models. If you want to store and process the knowledge of the
         | world, it is going to be expensive no matter what. Maybe we
         | could split the problem into two parts: Understanding language,
         | and world knowledge (with some messy middle ground). I believe
         | you could replace the world knowledge with a huge graph
         | database or triple store. Not just subject-verb-object, but
         | with attribution and certainty numbers for every fact. The idea
         | would be to query the database at inference time. I don't know
         | how to use this in conjunction with a transformer network like
         | GPT-3, so you'd likely need a very different architecture.
         | 
         | The big benefit of this would be that it is feasible to train
         | the language part without the world knowledge part with much
         | less resources. But you have other benefits, too. ChatGPT is
         | trained to "win the language game". But as they say, winning
         | the argument does not make you right. If you have a clean fact
         | database, you can have it weigh statements from trustworthy
         | sources higher. You then basically have a nice natural language
         | frontend to a logical reasoning system that can respond with
         | facts (or better: conclusions).
        
           | joaogui1 wrote:
           | This biggest most is _high-quality_ data. Both their
           | proprietary datasets (WebText, WebText2 etc), but also now
           | their human-annotated data. Another secondary moat is their
           | expertise with training models using PPO (their RL method),
           | they can get results that are quite better than other labs. I
           | say this moat is secondary because it 's possible that you
           | can get similar results with other RL algorithms (e.g.
           | DeepMind using MPO) and because maybe you don't really need
           | RL from Human Feedback, and just fine-tuning on instructions
           | is enough
        
             | Metus wrote:
             | I find OpenAI having exclusive access to that kind of high-
             | quality data more concerning than them having access to
             | their current amount of compute and currently trained
             | model. A couple of million dollars worth of compute is in
             | the realm of any medium sized research university, larger
             | company or any country worth of mention. And seeing as
             | Moore's law still applies to GPU, the cost will only fall.
             | 
             | However _high-quality data_ is scarce. I would be willing
             | to fund a proper effort to create high-quality data.
        
           | visarga wrote:
           | Check out DeepMind RETRO, it's one year old already, but
           | exactly what you say:
           | 
           | https://www.deepmind.com/publications/improving-language-
           | mod...
        
           | lossolo wrote:
           | It's not just about compute; if that were the case, then
           | models like BLOOM and OPT, which also have 175 billion
           | parameters, would have the same performance for real-world
           | use cases as GPT-3, but they don't. Datasets are also very
           | important.
        
           | ccozan wrote:
           | GPT and human brain ( at least the language / speech part )
           | have nothing in common. We, as humans, do not use language in
           | a generative way, is derived from a higher or very low level
           | of abstraction ( intentions, emotions, etc ) and is explictly
           | use for communicating something. Even this text is based on
           | previous knowledge, saved in an abstract way, and while
           | writing this I must follow the synthax of the language or
           | writing the right order otherwise, you , the person who reads
           | this, will not understand what I mean. While GPT can generate
           | the same text, it does not have a motivation and has no need
           | to communicate ( while I just wanted to feel good by bringing
           | some contribution on HN ).
           | 
           | So yes, very different architecture.
        
             | naasking wrote:
             | These are conceptual "differences" that don't actually
             | explain the mechanics of what's going on. For all you know
             | "motivation", "intentions", etc. are also just GPT-like
             | subsystems, in which case the underlying mechanics are not
             | as different as you imply.
        
               | mensetmanusman wrote:
               | If it were gpt-like sub systems, humans would be emitting
               | MWs of power instead of the 100W now.
               | 
               | Whatever humans have it is many orders of magnitude
               | better...
        
               | ben_w wrote:
               | That's the hardware it runs on, not the software
               | architecture of GPT. I could equally say that transistors
               | are faster than synapses by the same ratio that marathon
               | runners are faster than continental drift.
        
               | naasking wrote:
               | Or biology evolved a better way to do the same or similar
               | enough computation that we simply haven't yet discovered.
        
               | ImHereToVote wrote:
               | Emotion is just "spiritual" word for a utility function.
               | Or terminal goal to be more precise.
        
             | throwuwu wrote:
             | It seems to me that a lot of everyday communication is
             | rather statistical in nature. We don't necessarily think
             | deeply about each word choice but instead fall back on well
             | worn patterns and habits. We can be more deliberate about
             | how we compose our sentences but most situations don't call
             | for it. It makes me wonder if we don't all have a
             | generative language model embedded in our brains that
             | serves up the most likely next set of words based on our
             | current internal state.
        
             | thomastjeffery wrote:
             | Ok top of it not having "motivation" to communicate, it has
             | _literally nothing_ to be communicated in the first place.
             | 
             | That's the key difference. We use language to express
             | conceptualizations. We have some kind of abstract model
             | somewhere that we are translating.
             | 
             | Maybe it isn't a cohesive model either. All I can say for
             | certain is that - whatever it is - we are expressing it.
             | 
             | GPT does not express. It parrots. There is no
             | conceptualization.
        
               | captainmuon wrote:
               | The more experience I get, the more I wonder if this is
               | really the case for us. We certainly have some kind of
               | abstract model in our heads when thinking deeply about a
               | problem. But in many settings - in a work meeting, or
               | socially with friends - I think it is a much more
               | automatic process. The satisfaction you get when saying
               | the right thing, the dread when you say something stupid:
               | It is just like playing a game. Maybe the old
               | philosophical concept of society as merely "language
               | games" is correct after all. A bit silly but I find the
               | thought makes annoying meetings a bit more bearable.
               | 
               | But you are of course right with GPT, it has no inner
               | life and only parrots. It completely lacks something like
               | an inner state, an existence outside of the brief moment
               | it is invoked, or anything like reflection. Reminds me of
               | the novel "Blindsight" (which I actually haven't read
               | yet, but heard good things about!) where there are beings
               | that are intelligent, but not conscious.
        
               | thomastjeffery wrote:
               | Intelligent but not conscious would still be a few steps
               | ahead of GPT.
               | 
               | We can take a concept and refactor it symbolically. GPT
               | can't do that. All it does is find symbols that are
               | semantically close to other symbols.
        
             | ben_w wrote:
             | > and while writing this I must follow the synthax of the
             | language or writing the right order otherwise
             | 
             | A good example that is not, word randomised order and
             | kombination with Mrs Spelling and fonetic spel-ing prevent
             | ye knot that which I wrote you to komprehend.
             | 
             | (My apologies to non-native speakers of English; if someone
             | did that to me in German I'd have no clue what was meant).
             | 
             | A better point is that GPT-3's training set is more tokens
             | than the number of times an average human synapse fires in
             | a lifetime, squeezed into a network with about 3 orders of
             | magnitude fewer parameters than the human brain has
             | synapses.
             | 
             | It's wrong to model AI as anything like natural
             | intelligence, but if someone insists, my go-to comparison
             | (with an equivalent for image generators) is this: "Imagine
             | someone made a rat immortal, then made it browse the web
             | for 50,000 years. It's still a rat, despite being very
             | well-trained."
        
             | visarga wrote:
             | > GPT and human brain have nothing in common
             | 
             | Here we go again. They must have something in common,
             | because for about 90% of the tasks the language model
             | agrees with humans, even on novel tasks.
             | 
             | > We, as humans, do not use language in a generative way
             | 
             | Oh, do you want to say we are only doing classification
             | from a short list of classes and don't generate open ended
             | language? Weird, I speak novel word combinations all the
             | time.
        
               | ccozan wrote:
               | No, what is meant is that the next word I speak/write
               | after a current word are not based on a statistical
               | model, but on a world model which includes a language
               | structure based on a defined syntax and cultural variaty.
               | I actually mean what I say while the ChatGPT just parrots
               | around weights and produces an output based purely on
               | statistics. There is zere modeling which translates into
               | real world ( what normally we call "understanding" and
               | "experience" ).
               | 
               | As was said, a different architecture.
        
         | naraga wrote:
         | I think something of Seti@Home kind will come.
        
       | karpathy wrote:
       | Wow, fun to find this trending on HN this morning! I am currently
       | also working on the associated video lecture (as the next episode
       | of my video lecture series here https://karpathy.ai/zero-to-
       | hero.html ), where I will build nanoGPT from scratch and aspire
       | to spell everything out, as with the earlier videos. Hoping to
       | get it out in ~2 weeks or so.
        
         | TheAlchemist wrote:
         | Thank you for your amazing work. Between cs231n and your recent
         | videos, I've learned a ton - and you have a gift to explain
         | things in such an easy and straightforward way, that I'm always
         | feeling like an idiot (in a positive way) for not having
         | grasped the concept before.
        
         | katsucurry wrote:
         | I've found all of your code and lessons on youtube so
         | incredibly useful. You're a wonderful teacher and I really
         | appreciate all the work you've done with this!
        
         | imranq wrote:
         | Just wanted to say thank you for all the incredible work and
         | resources you publish. I've lost track of all the different
         | skills I've learned from you, from computer vision, RNNs,
         | minGPT, even speedcubing :D
        
         | StefanWestfal wrote:
         | Open accessible lectures / knowledge like yours allowed many
         | people, me included, to turn their life around by putting in
         | the effort and develop themselves. Thank you.
        
         | subbu wrote:
         | Your youtube playlist combined with NanoGPT and your Lex
         | Fridman podcast is like having a university level degree with a
         | free internship guidance. Thank you!
        
         | goldenshale wrote:
         | Bad ass! A great addition would be some content on tuning pre-
         | trained language models for particular purposes. It would be
         | great to have examples of things like tuning a GPT model
         | trained on language and code to take in a context and spit out
         | code in my custom API, or using my internal terminology. Not
         | sure if this is RL based fine tuning or just a bunch of
         | language to code examples in a fine tuning dataset? In essence,
         | how can we start using language to control our software?
        
           | karpathy wrote:
           | Ty agree, most people practically speaking will be interested
           | in finetuning rather than from-scratch pretraining. I
           | currently have some language about it in readme but I agree
           | this should get more focus, docs, examples, etc.
        
         | marviel wrote:
         | Your tutorials are effective and concise. Thank you for them!
         | Accessible, from-scratch knowledge on these topics is essential
         | at this time in history and you're really making a dent in that
         | problem.
        
         | eternalban wrote:
         | Thank you for sharing your knowledge. Anything that can be done
         | to democratize machine learning is an invaluable social
         | service. Hats off to you.
        
         | dsabanin wrote:
         | Thank you for your great work!
        
         | highfrequency wrote:
         | Appreciate the work to make GPT training accessible!
         | 
         | Do you leave hyperparams (like learning rate, batch size) the
         | same when switching from 8xA100 to fewer GPUs, or do these need
         | to be adjusted?
         | 
         | Separately, when going from 8xA100 GPU to a single A100 GPU, in
         | the worst case we can expect the same model performance after
         | training 8x as long correct? (And likely a bit better because
         | we get more gradient updates in with smaller batch size)
        
         | de_nied wrote:
         | Thank you for your constant contributions.
        
         | moralestapia wrote:
         | While doing my PhD some years ago (it wasn't a PhD on AI, but
         | very much related) I trained several models with the usual
         | stack back then (pytorch and some others in TF). I realized
         | that a lot of this stack could be rewritten in much simpler
         | terms without sacrificing much fidelity and/or performance in
         | the end.
         | 
         | Submissions like yours and other projects like this one
         | (recently featured here as well) ->
         | https://github.com/ggerganov/whisper.cpp, makes it pretty clear
         | to me that this intuition is correct.
         | 
         | There's a couple tools I created back then that could push
         | things further towards this direction, unfortunately they're
         | not mature enough to warrant a release but the ideas they
         | portray are worth taking a look at (IMHO) and I'll be happy to
         | share them. If there's interest on your side (or anyone reading
         | this thread) I'd love to talk more about it.
        
         | gtoubassi wrote:
         | +1. I've benefited greatly from your content, e.g. your CNN
         | lecture was incredibly accessible [0]. I still find
         | transformers stubbornly elude my intuitions despite reading
         | many descriptions. I would very much appreciate your video
         | lecture on this topic.
         | 
         | [0] I think https://www.youtube.com/watch?v=LxfUGhug-iQ
        
         | misza222 wrote:
         | Thanks for your work Andrej! I've been doing earlier lectures
         | and this is absolutely fantastic educational content!
        
         | cs702 wrote:
         | Andrej: thank you!
         | 
         | --
         | 
         | To the mod (dang): IMHO Andrej's comment should probably be at
         | the top of the page, not my comment. UPDATE: Looks like that's
         | done. Thank you :-)
        
       ___________________________________________________________________
       (page generated 2023-01-11 23:00 UTC)