[HN Gopher] The idea maze for AI startups (2015)
       ___________________________________________________________________
        
       The idea maze for AI startups (2015)
        
       Author : gmays
       Score  : 86 points
       Date   : 2023-06-28 14:28 UTC (8 hours ago)
        
 (HTM) web link (cdixon.org)
 (TXT) w3m dump (cdixon.org)
        
       | cornercasechase wrote:
       | Chris Dixon just spent 2 years investing in and hyping NFTs. AI
       | is a legitimate innovation that can do without all of the "Web3"
       | con artists driving any aspect of it.
       | 
       | Source: https://cdixon.org
        
         | rootusrootus wrote:
         | > just spent 2 years
         | 
         | So ... after he wrote this blog post?
        
           | ramraj07 wrote:
           | In my opinion it's legitimate to question the discretion of
           | anyone who put their entire weight behind cryptocrappency
           | especially NFT.
        
             | avarun wrote:
             | Maybe. What about questioning the discretion of anyone who
             | uses childish terminology like "cryptocrappency" and ad
             | hominems instead of making an actual point?
        
         | arberx wrote:
         | There's one for that:
         | 
         | https://cdixon.org/2010/01/03/the-next-big-thing-will-start-...
        
       | throwaway164023 wrote:
       | Chris is an absolute grifter - I read his book on web 3 and was
       | so underwhelmed by his depth of thought.
       | 
       | Better to ignore folks who have no experience in building cutting
       | edge product - he's just a average philosopher turned VC because
       | it pays more.
        
         | tough wrote:
         | He's launching a new book on web3 soon, isn't he?
         | 
         | Wondering if you're talking about that new one or a previous
         | one
        
       | anotherpaulg wrote:
       | I think there's a new approach for "How do you get the data?"
       | that wasn't available when this article was written in 2015. The
       | new text and image generative models can now be used to
       | synthesize training datasets.
       | 
       | I was working on an typing autocorrect project and needed a
       | corpus of "text messages". Most of the traditional NLP corpuses
       | like those available through NLTK [0] aren't suitable. But it was
       | easy to script ChatGPT to generate thousands of believable text
       | messages by throwing random topics at it.
       | 
       | Similarly, you can synthesize a training dataset by giving GPT
       | the outputs/labels and asking it to generate a variety of inputs.
       | For sentiment analysis... "Give me 1000 negative movie reviews"
       | and "Now give me 1000 positive movie reviews".
       | 
       | The Alpaca folks used GPT-3 to generate high-quality instruction-
       | following datasets [1] based on a small set of human samples.
       | 
       | Etc.
       | 
       | [0] https://www.nltk.org/nltk_data/
       | 
       | [1] https://crfm.stanford.edu/2023/03/13/alpaca.html
        
         | __loam wrote:
         | Sampling an AI output when the distribution you want is human
         | data is incredibly stupid.
        
         | alex_lav wrote:
         | It's funny, for my lil startup, "How do you get the data" is
         | now _less_ tech than ever. I pay an hourly wage to a human to
         | generate/transcribe it. This method is both much more cost
         | effective and scalable than tech-enabled alternatives.
        
         | Imnimo wrote:
         | An interesting question is, if you can get ChatGPT to generate
         | high quality data for you, should you just cut out the middle-
         | model and be using ChatGPT as your classifier?
         | 
         | The answer probably depends a lot on your specific problem
         | domain and constraints, but a non-trivial amount of the time
         | the answer will be that your task could be solved by a wrapper
         | around the ChatGPT API.
        
           | danuker wrote:
           | > should you just cut out the middle-model and be using
           | ChatGPT as your classifier?
           | 
           | And hope OpenAI forever provides the service, and at a
           | reasonable price, latency, and volume?
        
             | arbuge wrote:
             | Besides that, there's the issue of efficiency.
             | 
             | Better quality training data might enable you to build a
             | leaner more efficient model than is far cheaper to
             | implement and run than the expensive model used to generate
             | the data to train it.
             | 
             | See for example: https://twitter.com/SebastienBubeck/status
             | /16713263696268533...
        
           | og_kalu wrote:
           | >should you just cut out the middle-model and be using
           | ChatGPT as your classifier?
           | 
           | Oh you certainly could.
           | 
           | See here: GPT-3.5 outperforming elite crowdworkers on MTurk
           | for Text annotation https://arxiv.org/abs/2303.15056
           | 
           | GPT-4 going toe to toe with expertrs (and significantly
           | outperforming crowdworkers) on NLP tasks
           | 
           | https://www.artisana.ai/articles/gpt-4-outperforms-elite-
           | cro...
           | 
           | I guess it will tke some time before the reality really sinks
           | but the days of the artificial sota being obviously behind
           | human efforts for NLP has come and gone.
        
           | nfmcclure wrote:
           | You definitely can use LLMs to do your modeling. But
           | sometimes you need very fast, cheap, and smaller models
           | instead. Also there's research out there showing that using
           | LLM to generate training data for targeted & specific models
           | may result in better performance.
        
         | atleastoptimal wrote:
         | Is synthesized data high quality, or does it just seem high
         | quality
        
         | carbocation wrote:
         | Appears to be susceptible to model collapse[1], depending on
         | how you do it.
         | 
         | 1 = https://arxiv.org/abs/2305.17493v2
        
       | xigency wrote:
       | Not super on point, but wow that boilerplate at the end really
       | goes above and beyond at saying "this is my personal blog, and
       | just, like, my opinion man."
        
         | wanderingstan wrote:
         | I suspect it's due to his writing about crypto, where the all
         | the legal/regulatory risks around securities and financial
         | advice can be high.
        
       | icpmacdo wrote:
       | The linked document from Balaji's startup engineering course is
       | extremely useful
       | 
       | https://spark-public.s3.amazonaws.com/startup/lecture_slides...
        
         | lporto wrote:
         | Archived slides: https://github.com/ladamalina/coursera-startup
        
           | tester457 wrote:
           | Coursera link is broken unfortunately
        
       | carlossouza wrote:
       | Interesting read. I'd argue the most successful AI-based products
       | are the ones that settle for 80-90% accuracy and "Create a fault-
       | tolerant UX."
       | 
       | Then, the question becomes: how to create a great fault-tolerant
       | UX?
       | 
       | There are some nice recent cases... Github Copilot is one...
        
       | chrisdbanks wrote:
       | How much has changed since 2015. With NLP and ML it used to be so
       | hard to create high-quality datasets. It was a case of rubbish
       | in, rubbish out. Not LLMs have solved that problem. It seems that
       | if you put in a huge amount of data into a big enough model then
       | an emergent ability seems to be the ability to discern the wheat
       | from the chaff. Certainly in the NLP space, the days of crowd
       | sourced datasets seem to be over replaced with few shot learning.
       | So much value has been unlocked.
        
         | [deleted]
        
         | luckyt wrote:
         | I think the author's point broadly still holds -- you can get
         | further with more engineering resources and data, whether
         | you're using 2015 era models or 2023 retrieval-augmented LLMs
         | and fine-tuning. Just that now you can accomplish a lot more
         | quickly with a ChatGPT prompt.
        
         | dweekly wrote:
         | There's an interesting dark side to this as well, which is that
         | in 2023 when you think you are crowdsourcing data you may
         | actually just be tasking it to ChatGPT. A lot of turkers just
         | turn around and use an LLM!
        
           | __loam wrote:
           | Which is of course absolutely terrible for the quality of the
           | dataset you're trying to produce
        
       | RC_ITR wrote:
       | This is an interesting post (and an interesting reminder that
       | even Bitcoin maximalists had other things on their minds in
       | 2015).
       | 
       | I would argue that the first step of the maze makes _a ton_ of
       | sense for the voice recognition /image classification/driving
       | use-cases of 2015 that had binary outcomes, but now-a-days, what
       | would it even mean for an LLM to be right 80% of the time? 8/10
       | words are predicted correctly? It can speak correctly on 80% of
       | topics?
       | 
       | The reason people are so jazzed about generative AI is that it's
       | not autonomously doing a task - it's helping a human operator by
       | making (sometimes very useful) guesses on their behalf. It's much
       | more of a _tool_ than a solution (even if _a lot_ of people want
       | it to be a solution).
        
       ___________________________________________________________________
       (page generated 2023-06-28 23:00 UTC)