[HN Gopher] Act-1: Transformer for Actions
       ___________________________________________________________________
        
       Act-1: Transformer for Actions
        
       Author : thesephist
       Score  : 92 points
       Date   : 2022-09-14 20:24 UTC (2 hours ago)
        
 (HTM) web link (www.adept.ai)
 (TXT) w3m dump (www.adept.ai)
        
       | visarga wrote:
       | Related - GPT-3 with a Python interpreter can solve many tasks.
       | It is also language model + a computer, but on a different level.
       | 
       | https://mobile.twitter.com/sergeykarayev/status/156937788144...
        
       | codekansas wrote:
       | This is incredible :) Will you be releasing more information
       | about how the system was designed / how data was collected / how
       | actions are executed?
        
         | __sy__ wrote:
         | Second that. Also what guardrails look like for it :)
        
         | tasdfqwer0897 wrote:
         | Yes! We plan on putting out a more detailed technical post
         | soon.
        
       | i_am_toaster wrote:
       | I look forward to seeing the progress made on this in the future,
       | but at this time I don't see any potential in this product.
        
       | anigbrowl wrote:
       | - OK here's my email       - Please select all pictures of taxis
       | to prove you are not a robot       th_th
       | 
       | Seriously though, the potential is good. I see several things
       | they're doing right that have the potential to distinguish them
       | from competing offerings.
        
       | skybrian wrote:
       | Wow, what a great way to make a mess online! I can see spammers
       | using it, but who's going to trust this with access to any
       | accounts they care about?
        
         | visarga wrote:
         | I think it's going to be expensive to run and require an
         | account with AdeptAI. Most usage will be to automate office
         | work, RPA style. They could also detect malicious usage as the
         | model sees the pages and takes actions in those pages.
        
       | holoduke wrote:
       | How does the AI alter it's models during a process? I thought the
       | weight models are pregenerated and not altered once used in a
       | real-life app.
        
         | visarga wrote:
         | Could be prompt based memory, or fine-tuning a small part of
         | the big model.
        
       | mrits wrote:
       | Natural language interfaces are very limited and certainly not
       | the next generation of computing. Granularity of functionality
       | and composable input will always be more efficient as long as the
       | original source is a human. I think the natural language part of
       | your product is the lease interesting and certainly not the most
       | impressive.
        
         | jcims wrote:
         | You might not be the target audience though, eh?
        
         | davepeck wrote:
         | > Natural language interfaces are very limited and certainly
         | not the next generation of computing.
         | 
         | I'm game to take the other side of that wager.
         | 
         | My instinct from the last 5 years of advances in ML language
         | research is that we're right at the cusp of having _radically_
         | better natural language interfaces.
         | 
         | I chose a half-decade horizon because "Attention Is All You
         | Need", the paper that introduced the transformer model, was
         | published in 2017. Two of its co-authors are co-founders of
         | Adept.
        
         | amilios wrote:
         | On the flipside, natural language interfaces have the potential
         | to be extremely easy to use for anyone, including non-experts.
         | Anyone can type a message to the computer, without having to
         | learn the specifics of an interface's custom controls. There
         | are different types of efficiency. I'm assuming you're
         | referring to something like 'operational efficiency', while NLI
         | wins on 'ease of adoption' per se.
        
           | angrais wrote:
           | Consider search engines and searching for content more
           | generally: are you certain that natural language is most
           | often used?
           | 
           | When I search for content, I use key terms to produce refined
           | and better results. If you don't use such terms then what
           | you're looking for may be difficult to find.
        
             | dwrodri wrote:
             | So, I actually thought the opposite: the growth in userbase
             | that most big tech companies have seen over the 2010s
             | probably meant the "death" of people creating queries a la
             | AltaVista. But it turns out, I think I'm wrong, Google's
             | own research says the average query is between 2-3 words in
             | size, but apparently it has in fact been going up over the
             | years.
             | 
             | Here's my source: https://dl.acm.org/doi/pdf/10.1145/175332
             | 6.1753333?casa_toke...
        
       | bluecoconut wrote:
       | Wow! Love it, this is the most exciting thing I've seen in a
       | while. I'm working on something similar, and it's so great to see
       | others who seem to get-it and are chasing generalization in AI
       | systems!
       | 
       | A few questions:
       | 
       | 1. I'm curious if you're representing the task-operations using
       | RL techniques (as many personal assistant systems seem to be) or
       | if this is entirely a seq2seq transformer style model for
       | predicting actions?
       | 
       | 2. Assumption: Due to scaling of transformers, I assume that this
       | is not directly working on the image data of a screen, and
       | instead is working off of DOM trees; (2a) is this the case? and
       | (2b) if so, are you using purely linear tokenization of the tree
       | or are you using something closer to Evoformer (AlphaFold style)
       | to combine graphs-neural nets and transformers?
       | 
       | 3. Have you noticed that learning actions and representations of
       | one application transfers well to new applications? or is the
       | quality of the model heavily dependent on app domain?
       | 
       | I noticed multiple references to data applications (Excel,
       | tableau, etc.). My challenge is that large language models and AI
       | systems in general are about to hit a wall in the data domain
       | because they fundamentally don't understand data [1] [2], which
       | will ultimately limit the quality of these capabilities.
       | 
       | I am personally tackling this problem directly. I'm tying to
       | prove more coherent data-aware operations in these systems by
       | building a "foundation model" for tabular data that connects to
       | LLMs (think RETRO style lookups of embeddings (representing
       | columns of data)). I have been prototyping conversational AI
       | systems (mostly Q/A oriented), and have recently been moving
       | towards task oriented operations (right now, transparently, just
       | SQL executors).
       | 
       | There seem to be good representations of DOM tree/visual-object
       | models that you all are working with to take reasonable action,
       | however I assume these are limited in scale (N^2 and all), and so
       | I am wondering if you have any opinions on how to extend these
       | systems for data (especially as the "windowed context grows" (eg.
       | an excel with 100k+ rows))?
       | 
       | [1] https://arxiv.org/abs/2106.03253 "Tabular Data: Deep Learning
       | is Not All You Need" [2] https://arxiv.org/abs/2110.01889 "In
       | summary, we think that a fundamental reorientation of the domain
       | may be necessary. For now, the question of whether the use of
       | current deep learning techniques is beneficial for tabular data
       | can generally be answered in the negative"
        
       | colemannugent wrote:
       | So here's the main problem I see with this:
       | 
       | >Anyone who can articulate their ideas in language can implement
       | them
       | 
       | I'd be shocked if even 10% of the users who can't navigate a GUI
       | could accurately describe what they want the software to do. To
       | the user who doesn't know they can use Ctrl-Z to undo, the first
       | half dozen times the AI mangles their inherited spreadsheet might
       | be enough to put them off the idea.
        
         | ffhhj wrote:
         | But those who can articulate will have a very quick automation
         | tool to scrap data from the web.
        
           | tartoran wrote:
           | I've been thinking for a while about a common people
           | programming language able to interface with machines with
           | pure casual conversation ( not exact commands) and I feel
           | something it's coming in the next decades even if not
           | earlier. Imagine the ability to casually chat with a widget
           | which understands flawlessly and where most devices would be
           | able to communicate as well. This could eventually be used in
           | psychotherapy, everything automation around humans and in
           | nefarious ways as well. I'm only hopeful of a human
           | augmentation scenario but there are countless ways it could
           | become totally different.
        
       | tasdfqwer0897 wrote:
       | Hey, I helped make this! Happy to answer any questions.
        
         | learndeeply wrote:
         | Thanks for answering questions!
         | 
         | Are the example given in the blog post considered zero-shot
         | learning?
         | 
         | Was the model trained on the websites in the examples given
         | (e.g. on the Redfin site)?
         | 
         | How much labeled data was used?
        
         | tux3 wrote:
         | On an unrelated note, I imagine this can solve recaptchas and
         | other simple non-visual challenges.
         | 
         | Can I make ACT-1 Sybil a few thousand people on mechanical
         | turk?
         | 
         | Can I submit CVs with ACT-1 for entry-level full remote jobs
         | and have it work for legacy companies, if those companies
         | cannot setup ACT-1 themselves but provide a traditional human
         | jobs interface?
         | 
         | Can I put an interface that interracts with the real world
         | through controls and text on a webpage and have ACT-1 take a
         | physical presence?
        
         | aaaaaaaaaaab wrote:
         | What was the training data?
        
           | version_five wrote:
           | Also, are there benchmark tasks that you either created or
           | that already exist that you evaluated the model on?
           | 
           | PS - please don't let this me used at a way to prevent human
           | interaction. Chatbots are a disaster and literally the worst
           | possible application of ML, as a shitty interface to a menu
           | system. I hope this will be used in a way that is not
           | consumer-hostile and that the company actively resists
           | ignorant business attempts to use it to avoid paying for
           | customer support.
        
           | tasdfqwer0897 wrote:
           | We used a combination of human demonstrations and feedback
           | data! You need custom software both to record the
           | demonstrations and to represent the state of the Tool in a
           | model-consumable way.
        
             | angrais wrote:
             | How were the demonstrations annotated? Did you use
             | annotations?
        
             | elcomet wrote:
             | How many demonstrations were used ?
             | 
             | And was the feedback data used to train the model with
             | reinforcement learning? Or did you request users to
             | "correct" the action and get a supervised signal?
        
       ___________________________________________________________________
       (page generated 2022-09-14 23:00 UTC)