[HN Gopher] Act-1: Transformer for Actions
___________________________________________________________________
Act-1: Transformer for Actions
Author : thesephist
Score : 92 points
Date : 2022-09-14 20:24 UTC (2 hours ago)
(HTM) web link (www.adept.ai)
(TXT) w3m dump (www.adept.ai)
| visarga wrote:
| Related - GPT-3 with a Python interpreter can solve many tasks.
| It is also language model + a computer, but on a different level.
|
| https://mobile.twitter.com/sergeykarayev/status/156937788144...
| codekansas wrote:
| This is incredible :) Will you be releasing more information
| about how the system was designed / how data was collected / how
| actions are executed?
| __sy__ wrote:
| Second that. Also what guardrails look like for it :)
| tasdfqwer0897 wrote:
| Yes! We plan on putting out a more detailed technical post
| soon.
| i_am_toaster wrote:
| I look forward to seeing the progress made on this in the future,
| but at this time I don't see any potential in this product.
| anigbrowl wrote:
| - OK here's my email - Please select all pictures of taxis
| to prove you are not a robot th_th
|
| Seriously though, the potential is good. I see several things
| they're doing right that have the potential to distinguish them
| from competing offerings.
| skybrian wrote:
| Wow, what a great way to make a mess online! I can see spammers
| using it, but who's going to trust this with access to any
| accounts they care about?
| visarga wrote:
| I think it's going to be expensive to run and require an
| account with AdeptAI. Most usage will be to automate office
| work, RPA style. They could also detect malicious usage as the
| model sees the pages and takes actions in those pages.
| holoduke wrote:
| How does the AI alter it's models during a process? I thought the
| weight models are pregenerated and not altered once used in a
| real-life app.
| visarga wrote:
| Could be prompt based memory, or fine-tuning a small part of
| the big model.
| mrits wrote:
| Natural language interfaces are very limited and certainly not
| the next generation of computing. Granularity of functionality
| and composable input will always be more efficient as long as the
| original source is a human. I think the natural language part of
| your product is the lease interesting and certainly not the most
| impressive.
| jcims wrote:
| You might not be the target audience though, eh?
| davepeck wrote:
| > Natural language interfaces are very limited and certainly
| not the next generation of computing.
|
| I'm game to take the other side of that wager.
|
| My instinct from the last 5 years of advances in ML language
| research is that we're right at the cusp of having _radically_
| better natural language interfaces.
|
| I chose a half-decade horizon because "Attention Is All You
| Need", the paper that introduced the transformer model, was
| published in 2017. Two of its co-authors are co-founders of
| Adept.
| amilios wrote:
| On the flipside, natural language interfaces have the potential
| to be extremely easy to use for anyone, including non-experts.
| Anyone can type a message to the computer, without having to
| learn the specifics of an interface's custom controls. There
| are different types of efficiency. I'm assuming you're
| referring to something like 'operational efficiency', while NLI
| wins on 'ease of adoption' per se.
| angrais wrote:
| Consider search engines and searching for content more
| generally: are you certain that natural language is most
| often used?
|
| When I search for content, I use key terms to produce refined
| and better results. If you don't use such terms then what
| you're looking for may be difficult to find.
| dwrodri wrote:
| So, I actually thought the opposite: the growth in userbase
| that most big tech companies have seen over the 2010s
| probably meant the "death" of people creating queries a la
| AltaVista. But it turns out, I think I'm wrong, Google's
| own research says the average query is between 2-3 words in
| size, but apparently it has in fact been going up over the
| years.
|
| Here's my source: https://dl.acm.org/doi/pdf/10.1145/175332
| 6.1753333?casa_toke...
| bluecoconut wrote:
| Wow! Love it, this is the most exciting thing I've seen in a
| while. I'm working on something similar, and it's so great to see
| others who seem to get-it and are chasing generalization in AI
| systems!
|
| A few questions:
|
| 1. I'm curious if you're representing the task-operations using
| RL techniques (as many personal assistant systems seem to be) or
| if this is entirely a seq2seq transformer style model for
| predicting actions?
|
| 2. Assumption: Due to scaling of transformers, I assume that this
| is not directly working on the image data of a screen, and
| instead is working off of DOM trees; (2a) is this the case? and
| (2b) if so, are you using purely linear tokenization of the tree
| or are you using something closer to Evoformer (AlphaFold style)
| to combine graphs-neural nets and transformers?
|
| 3. Have you noticed that learning actions and representations of
| one application transfers well to new applications? or is the
| quality of the model heavily dependent on app domain?
|
| I noticed multiple references to data applications (Excel,
| tableau, etc.). My challenge is that large language models and AI
| systems in general are about to hit a wall in the data domain
| because they fundamentally don't understand data [1] [2], which
| will ultimately limit the quality of these capabilities.
|
| I am personally tackling this problem directly. I'm tying to
| prove more coherent data-aware operations in these systems by
| building a "foundation model" for tabular data that connects to
| LLMs (think RETRO style lookups of embeddings (representing
| columns of data)). I have been prototyping conversational AI
| systems (mostly Q/A oriented), and have recently been moving
| towards task oriented operations (right now, transparently, just
| SQL executors).
|
| There seem to be good representations of DOM tree/visual-object
| models that you all are working with to take reasonable action,
| however I assume these are limited in scale (N^2 and all), and so
| I am wondering if you have any opinions on how to extend these
| systems for data (especially as the "windowed context grows" (eg.
| an excel with 100k+ rows))?
|
| [1] https://arxiv.org/abs/2106.03253 "Tabular Data: Deep Learning
| is Not All You Need" [2] https://arxiv.org/abs/2110.01889 "In
| summary, we think that a fundamental reorientation of the domain
| may be necessary. For now, the question of whether the use of
| current deep learning techniques is beneficial for tabular data
| can generally be answered in the negative"
| colemannugent wrote:
| So here's the main problem I see with this:
|
| >Anyone who can articulate their ideas in language can implement
| them
|
| I'd be shocked if even 10% of the users who can't navigate a GUI
| could accurately describe what they want the software to do. To
| the user who doesn't know they can use Ctrl-Z to undo, the first
| half dozen times the AI mangles their inherited spreadsheet might
| be enough to put them off the idea.
| ffhhj wrote:
| But those who can articulate will have a very quick automation
| tool to scrap data from the web.
| tartoran wrote:
| I've been thinking for a while about a common people
| programming language able to interface with machines with
| pure casual conversation ( not exact commands) and I feel
| something it's coming in the next decades even if not
| earlier. Imagine the ability to casually chat with a widget
| which understands flawlessly and where most devices would be
| able to communicate as well. This could eventually be used in
| psychotherapy, everything automation around humans and in
| nefarious ways as well. I'm only hopeful of a human
| augmentation scenario but there are countless ways it could
| become totally different.
| tasdfqwer0897 wrote:
| Hey, I helped make this! Happy to answer any questions.
| learndeeply wrote:
| Thanks for answering questions!
|
| Are the example given in the blog post considered zero-shot
| learning?
|
| Was the model trained on the websites in the examples given
| (e.g. on the Redfin site)?
|
| How much labeled data was used?
| tux3 wrote:
| On an unrelated note, I imagine this can solve recaptchas and
| other simple non-visual challenges.
|
| Can I make ACT-1 Sybil a few thousand people on mechanical
| turk?
|
| Can I submit CVs with ACT-1 for entry-level full remote jobs
| and have it work for legacy companies, if those companies
| cannot setup ACT-1 themselves but provide a traditional human
| jobs interface?
|
| Can I put an interface that interracts with the real world
| through controls and text on a webpage and have ACT-1 take a
| physical presence?
| aaaaaaaaaaab wrote:
| What was the training data?
| version_five wrote:
| Also, are there benchmark tasks that you either created or
| that already exist that you evaluated the model on?
|
| PS - please don't let this me used at a way to prevent human
| interaction. Chatbots are a disaster and literally the worst
| possible application of ML, as a shitty interface to a menu
| system. I hope this will be used in a way that is not
| consumer-hostile and that the company actively resists
| ignorant business attempts to use it to avoid paying for
| customer support.
| tasdfqwer0897 wrote:
| We used a combination of human demonstrations and feedback
| data! You need custom software both to record the
| demonstrations and to represent the state of the Tool in a
| model-consumable way.
| angrais wrote:
| How were the demonstrations annotated? Did you use
| annotations?
| elcomet wrote:
| How many demonstrations were used ?
|
| And was the feedback data used to train the model with
| reinforcement learning? Or did you request users to
| "correct" the action and get a supervised signal?
___________________________________________________________________
(page generated 2022-09-14 23:00 UTC)