[HN Gopher] Understanding HTML with Large Language Models
___________________________________________________________________
Understanding HTML with Large Language Models
Author : PaulHoule
Score : 35 points
Date : 2022-10-11 19:26 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| bootcat wrote:
| Looking forward to the code and the model,
| drothlis wrote:
| Related, "natbot" uses the stock GPT-3 model (no fine-tuning
| apart from the examples in the prompt) to drive a browser:
|
| https://github.com/nat/natbot
| ankrgyl wrote:
| There is a visual demo here:
| https://sites.google.com/view/llm4html/home.
|
| This work is very exciting to me for a few reasons:
|
| - HTML is an incredibly rich source of visually structured
| information, with a semi-structured representation. This is as
| opposed to PDFs, which are usually fed into models with a "flat"
| representation (words + bounding boxes). Intuitively, this offers
| the model a more direct way to learn about nested structure, over
| an almost unlimited source of unsupervised pre-training data.
|
| - Many projects (e.g. Pix2Struct
| https://arxiv.org/pdf/2210.03347.pdf, also from Google) operate
| on pixels, which are expensive (both to render and process in the
| transformer). Operating on HTML directly means smaller, faster,
| more efficient models.
|
| - (If open sourced) it will be the first (AFAIK) open pre-trained
| ready-to-go model for the RPA/automation space (there are several
| closed projects). They claim they plan to open source the dataset
| at least, which is very exciting.
|
| I'm particularly excited to extend this and similar
| (https://arxiv.org/abs/2110.08518) for HTML question answering
| and web scraping.
|
| Disclaimer: I'm the CEO of Impira, which creates OSS
| (https://github.com/impira/docquery) and proprietary
| (http://impira.com/) tools for analyzing business documents. I am
| not affiliated with this project.
| hwers wrote:
| This is google, they for sure aren't releasing the weights
| ShamelessC wrote:
| Exciting/scary stuff! A sophisticated enough version could
| carry out any range of tasks that a typical computer
| user/browser could from just a few sentences with somewhat high
| chance of success.
|
| we will overuse this tech, forgetting important processes that
| are perhaps wise to keep a "human backup" for redundancy. Then
| again, RPA is already a case where a "proper" rewrite of some
| multi-program pipeline is impossible.
| ankrgyl wrote:
| This is a "classic" tension. Having worked in the (broader)
| RPA space for a while, I would say that the true north star
| of most processes is (a) rewriting the internal procedures to
| be transformations on data (not UIs) and (b) standardizing
| communication across companies.
|
| There is a lot of momentum to solve (a) with no code, but
| it's slow because processes are impossibly complex. I think
| AI will accelerate this and could result in the "human
| backup" dystopia. On the other hand, AI can also be used to
| generate code, and I'm optimistic that technology like this
| can accelerate humans' ability to encode complex processes
| robustly (as transformations of data) and would 10 or 100x
| less work than no/low code.
| ShamelessC wrote:
| > On the other hand, AI can also be used to generate code,
| and I'm optimistic that technology like this can accelerate
| humans' ability to encode complex processes robustly (as
| transformations of data) and would 10 or 100x less work
| than no/low code.
|
| Ah right, lots of angles to consider! A hybrid system would
| certainly be interesting. Let the AI runtime generate and
| evaluate code to perform tasks (e.g. selenium/puppeteer in
| python/java). Upon failure, "escalate permissions" to
| enable DOM control, or full mouse/keyboard to complete the
| task (probably best not to let the thing open up a code-
| editor with M/KB controls though heh)
___________________________________________________________________
(page generated 2022-10-11 23:00 UTC)