[HN Gopher] The idea maze for AI startups (2015)
___________________________________________________________________
The idea maze for AI startups (2015)
Author : gmays
Score : 86 points
Date : 2023-06-28 14:28 UTC (8 hours ago)
(HTM) web link (cdixon.org)
(TXT) w3m dump (cdixon.org)
| cornercasechase wrote:
| Chris Dixon just spent 2 years investing in and hyping NFTs. AI
| is a legitimate innovation that can do without all of the "Web3"
| con artists driving any aspect of it.
|
| Source: https://cdixon.org
| rootusrootus wrote:
| > just spent 2 years
|
| So ... after he wrote this blog post?
| ramraj07 wrote:
| In my opinion it's legitimate to question the discretion of
| anyone who put their entire weight behind cryptocrappency
| especially NFT.
| avarun wrote:
| Maybe. What about questioning the discretion of anyone who
| uses childish terminology like "cryptocrappency" and ad
| hominems instead of making an actual point?
| arberx wrote:
| There's one for that:
|
| https://cdixon.org/2010/01/03/the-next-big-thing-will-start-...
| throwaway164023 wrote:
| Chris is an absolute grifter - I read his book on web 3 and was
| so underwhelmed by his depth of thought.
|
| Better to ignore folks who have no experience in building cutting
| edge product - he's just a average philosopher turned VC because
| it pays more.
| tough wrote:
| He's launching a new book on web3 soon, isn't he?
|
| Wondering if you're talking about that new one or a previous
| one
| anotherpaulg wrote:
| I think there's a new approach for "How do you get the data?"
| that wasn't available when this article was written in 2015. The
| new text and image generative models can now be used to
| synthesize training datasets.
|
| I was working on an typing autocorrect project and needed a
| corpus of "text messages". Most of the traditional NLP corpuses
| like those available through NLTK [0] aren't suitable. But it was
| easy to script ChatGPT to generate thousands of believable text
| messages by throwing random topics at it.
|
| Similarly, you can synthesize a training dataset by giving GPT
| the outputs/labels and asking it to generate a variety of inputs.
| For sentiment analysis... "Give me 1000 negative movie reviews"
| and "Now give me 1000 positive movie reviews".
|
| The Alpaca folks used GPT-3 to generate high-quality instruction-
| following datasets [1] based on a small set of human samples.
|
| Etc.
|
| [0] https://www.nltk.org/nltk_data/
|
| [1] https://crfm.stanford.edu/2023/03/13/alpaca.html
| __loam wrote:
| Sampling an AI output when the distribution you want is human
| data is incredibly stupid.
| alex_lav wrote:
| It's funny, for my lil startup, "How do you get the data" is
| now _less_ tech than ever. I pay an hourly wage to a human to
| generate/transcribe it. This method is both much more cost
| effective and scalable than tech-enabled alternatives.
| Imnimo wrote:
| An interesting question is, if you can get ChatGPT to generate
| high quality data for you, should you just cut out the middle-
| model and be using ChatGPT as your classifier?
|
| The answer probably depends a lot on your specific problem
| domain and constraints, but a non-trivial amount of the time
| the answer will be that your task could be solved by a wrapper
| around the ChatGPT API.
| danuker wrote:
| > should you just cut out the middle-model and be using
| ChatGPT as your classifier?
|
| And hope OpenAI forever provides the service, and at a
| reasonable price, latency, and volume?
| arbuge wrote:
| Besides that, there's the issue of efficiency.
|
| Better quality training data might enable you to build a
| leaner more efficient model than is far cheaper to
| implement and run than the expensive model used to generate
| the data to train it.
|
| See for example: https://twitter.com/SebastienBubeck/status
| /16713263696268533...
| og_kalu wrote:
| >should you just cut out the middle-model and be using
| ChatGPT as your classifier?
|
| Oh you certainly could.
|
| See here: GPT-3.5 outperforming elite crowdworkers on MTurk
| for Text annotation https://arxiv.org/abs/2303.15056
|
| GPT-4 going toe to toe with expertrs (and significantly
| outperforming crowdworkers) on NLP tasks
|
| https://www.artisana.ai/articles/gpt-4-outperforms-elite-
| cro...
|
| I guess it will tke some time before the reality really sinks
| but the days of the artificial sota being obviously behind
| human efforts for NLP has come and gone.
| nfmcclure wrote:
| You definitely can use LLMs to do your modeling. But
| sometimes you need very fast, cheap, and smaller models
| instead. Also there's research out there showing that using
| LLM to generate training data for targeted & specific models
| may result in better performance.
| atleastoptimal wrote:
| Is synthesized data high quality, or does it just seem high
| quality
| carbocation wrote:
| Appears to be susceptible to model collapse[1], depending on
| how you do it.
|
| 1 = https://arxiv.org/abs/2305.17493v2
| xigency wrote:
| Not super on point, but wow that boilerplate at the end really
| goes above and beyond at saying "this is my personal blog, and
| just, like, my opinion man."
| wanderingstan wrote:
| I suspect it's due to his writing about crypto, where the all
| the legal/regulatory risks around securities and financial
| advice can be high.
| icpmacdo wrote:
| The linked document from Balaji's startup engineering course is
| extremely useful
|
| https://spark-public.s3.amazonaws.com/startup/lecture_slides...
| lporto wrote:
| Archived slides: https://github.com/ladamalina/coursera-startup
| tester457 wrote:
| Coursera link is broken unfortunately
| carlossouza wrote:
| Interesting read. I'd argue the most successful AI-based products
| are the ones that settle for 80-90% accuracy and "Create a fault-
| tolerant UX."
|
| Then, the question becomes: how to create a great fault-tolerant
| UX?
|
| There are some nice recent cases... Github Copilot is one...
| chrisdbanks wrote:
| How much has changed since 2015. With NLP and ML it used to be so
| hard to create high-quality datasets. It was a case of rubbish
| in, rubbish out. Not LLMs have solved that problem. It seems that
| if you put in a huge amount of data into a big enough model then
| an emergent ability seems to be the ability to discern the wheat
| from the chaff. Certainly in the NLP space, the days of crowd
| sourced datasets seem to be over replaced with few shot learning.
| So much value has been unlocked.
| [deleted]
| luckyt wrote:
| I think the author's point broadly still holds -- you can get
| further with more engineering resources and data, whether
| you're using 2015 era models or 2023 retrieval-augmented LLMs
| and fine-tuning. Just that now you can accomplish a lot more
| quickly with a ChatGPT prompt.
| dweekly wrote:
| There's an interesting dark side to this as well, which is that
| in 2023 when you think you are crowdsourcing data you may
| actually just be tasking it to ChatGPT. A lot of turkers just
| turn around and use an LLM!
| __loam wrote:
| Which is of course absolutely terrible for the quality of the
| dataset you're trying to produce
| RC_ITR wrote:
| This is an interesting post (and an interesting reminder that
| even Bitcoin maximalists had other things on their minds in
| 2015).
|
| I would argue that the first step of the maze makes _a ton_ of
| sense for the voice recognition /image classification/driving
| use-cases of 2015 that had binary outcomes, but now-a-days, what
| would it even mean for an LLM to be right 80% of the time? 8/10
| words are predicted correctly? It can speak correctly on 80% of
| topics?
|
| The reason people are so jazzed about generative AI is that it's
| not autonomously doing a task - it's helping a human operator by
| making (sometimes very useful) guesses on their behalf. It's much
| more of a _tool_ than a solution (even if _a lot_ of people want
| it to be a solution).
___________________________________________________________________
(page generated 2023-06-28 23:00 UTC)