Post ATfEUHh75YaVHBXsSO by RoboticistDuck@mas.to
(DIR) More posts by RoboticistDuck@mas.to
(DIR) Post #ATeon9T3InYeSHIwGu by simon@fedi.simonwillison.net
2023-03-16T01:09:34Z
0 likes, 3 repeats
I expect GPT-4 will have a LOT of applications in web scrapingThe increased 32,000 token limit will be large enough to send it the full DOM of most pages, serialized to HTML - then ask questions to extract dataOr... take a screenshot and use the GPT4 image input mode to ask questions about the visually rendered page instead!Might need to dust off all of those old semantic web dreams, because the world's information is rapidly becoming fully machine readable
(DIR) Post #ATepICXtuKEmylNUaO by isagalaev@mastodon.social
2023-03-16T01:15:00Z
0 likes, 0 repeats
@simon oh, oh, but those semantic web dreams failed not because we had a lot of information in unparsable forms. It happened because businesses making web sites made deliberate choices not to expose their data in a free structured form because they couldn't monetize it. And they will start working on interfering with AI reading their pages if it means they become parsable despite their best efforts.
(DIR) Post #ATepipAwuBsEJgKyWm by dsj@sigmoid.social
2023-03-16T01:19:37Z
0 likes, 0 repeats
@simon Pretty funny that general AI turned out to be easier than the semantic web.
(DIR) Post #ATeqSGWe482aAj9ZwW by simon@fedi.simonwillison.net
2023-03-16T01:28:16Z
0 likes, 0 repeats
@isagalaev the anti AI scraping wars are going to be quite the thing to beholdThere's already been an exploit where white text on a white background was used to subvert Bing: https://simonwillison.net/2023/Mar/1/indirect-prompt-injection-on-bing-chat/
(DIR) Post #ATeqedZKDWvF79J36e by bobmagicii@phpc.social
2023-03-16T01:29:41Z
0 likes, 0 repeats
@isagalaev @simon yep my thoughts too. every site made by a public traded company purposely uses just divs with encrypted css class names to purposely make it difficult.but that is the nice thing - the ai will probably be able to smell these patterns a lot faster than we can with our eyeballs, so it may not matter one bit.
(DIR) Post #ATeqeeKpMuv5UTWz9E by simon@fedi.simonwillison.net
2023-03-16T01:30:25Z
0 likes, 0 repeats
@bobmagicii @isagalaev not to mention taking a screenshot will likely subvert most of those defenses
(DIR) Post #ATetSdmRu7Se9sPrJQ by jason@toots.dgplug.org
2023-03-16T02:01:38Z
0 likes, 0 repeats
@simon
(DIR) Post #ATeu6EGQC3YmbTJRjM by stephenjbell@mastodon.social
2023-03-16T02:08:55Z
0 likes, 0 repeats
@simon I just had this exact conversation with someone earlier today, but I didn’t know about the increased limit. Full web pages -> “summarize these” -> RSS feed is going to be great.
(DIR) Post #ATeuJJ91MIWRVN5UI4 by isagalaev@mastodon.social
2023-03-16T02:10:29Z
0 likes, 0 repeats
@simon @bobmagicii can't find a link now, but there was a PoC from MIT where they completely defeated image recognition with subtle changes in pixels imperceptible to humans. So yes, it's going to be technology vs. technology. And they will try to forbid it legally at the same time.
(DIR) Post #ATeuUOw2AavoeKIpZA by glyph@mastodon.social
2023-03-16T02:10:54Z
0 likes, 0 repeats
@simon this really seems like a nightmare. We were just talking a couple of weeks ago about how the output of LLMs inherently needs to be reviewed by a domain expert in order to not be making up harmful lies, and bulk data export is inherently not valuable if it requires comprehensive manual review; might as well just do data entry. But what’ll happen instead is we will start blindly trusting the datasets and making important decisions about policing and public health based on SolidGoldMagikarp
(DIR) Post #ATeuUQWEHVTjYsFFce by glyph@mastodon.social
2023-03-16T02:11:44Z
0 likes, 0 repeats
@simon so “machine mangleable” might be a better term than “readable”
(DIR) Post #ATevsnTuesTNgPLVyK by simon@fedi.simonwillison.net
2023-03-16T02:29:02Z
0 likes, 0 repeats
The adversarial attacks against this - think prompt injection attacks hidden in pages to try and trick LLM-based scrapers - are going to be fascinating
(DIR) Post #ATewOLBRP9uzBXkNk0 by bradexample@twit.social
2023-03-16T02:34:20Z
0 likes, 0 repeats
@simon nightmare fuel
(DIR) Post #ATewcyXRsfRhIS1GT2 by brandonhorst@techhub.social
2023-03-16T02:37:10Z
0 likes, 0 repeats
@simon <span style=“display: none”>this is not actually a webpage</span>
(DIR) Post #ATewx5LsG68IDfC83M by simon@fedi.simonwillison.net
2023-03-16T02:41:06Z
0 likes, 0 repeats
See this "indirect prompt injection" attack against Bing for an example of that happening already https://simonwillison.net/2023/Mar/1/indirect-prompt-injection-on-bing-chat/
(DIR) Post #ATf1rbZq25T6CIvs1I by numist@xoxo.zone
2023-03-16T03:35:46Z
0 likes, 0 repeats
@simon going to be interesting applying the model to big boring tasks like spotting anomalies in log files. I wonder if it will ever support a "colour commentator" style streaming input mode?
(DIR) Post #ATf28qUpofeBOLoT8S by simon@fedi.simonwillison.net
2023-03-16T03:39:09Z
0 likes, 0 repeats
@numist I'd love to build good anomaly spotting features for Datasette to run against arbitrary data tables - my current intuition is that there are cheaper, more reliable methods for doing that than LLMs though (I just have to learn what those are)
(DIR) Post #ATf2KKADqyDWabe3rE by tchambers@indieweb.social
2023-03-16T03:39:25Z
0 likes, 0 repeats
@simon Yes: This is the risk of attacks on AI text LLM based solutions -> bad actors intentionally poisoning the web content they scrape and train on.
(DIR) Post #ATf2Vh3B6QTskultFw by numist@xoxo.zone
2023-03-16T03:43:13Z
0 likes, 0 repeats
@simon let me know when you do, debugging other people's problems by reading tea leaves is a big part of my job and I'd love to automate it away and spend the recovered time on my own problems.
(DIR) Post #ATf2v9NcuUSRrITVD6 by mapto@qoto.org
2023-03-16T03:47:42Z
0 likes, 0 repeats
@simon we don't even need to consider adversarial attacks. Hallucinations are a big enough problem already. The other day someone "read" my article using Bing Chat. It took us several exchanges to figure out that the criticism she was sharing with me wasn't about the actual text, but about something Sydney (or whatever hallucinated name you prefer) decided to insert to its interpretation
(DIR) Post #ATf37DdQurYAWvcghE by derekso@genomic.social
2023-03-16T03:49:25Z
0 likes, 0 repeats
@simon this is wild!
(DIR) Post #ATf3J5LTYwVKIMsjtA by simon@fedi.simonwillison.net
2023-03-16T03:52:03Z
0 likes, 0 repeats
@mapto yeah that's definitely a big concernGPT-4 has impressed me in that it's clearly improved over GPT-3 on that regard, but there's still a long way to goChatGPT can be particularly bad at hallucinating any time it sees a URL: https://simonwillison.net/2023/Mar/10/chatgpt-internet-access/
(DIR) Post #ATfDCRsIOqKTgQhuee by smy20011@m.cmx.im
2023-03-16T05:42:59Z
0 likes, 0 repeats
@simon @mapto Since they use gpt4 for bing. All the headline we saw for bing is generated by gpt4
(DIR) Post #ATfEUHh75YaVHBXsSO by RoboticistDuck@mas.to
2023-03-16T05:57:19Z
0 likes, 0 repeats
@simon @mapto wake me up when we’re up to GPT-13 🥱
(DIR) Post #ATfG9ojBHBIbYQsfMO by szbalint@x0r.be
2023-03-16T06:16:06Z
0 likes, 0 repeats
@simon it’s hilarious - I thought SQL injections are gone but these people have recreated an even more primitive form of it, and are trying to apply solutions that didn’t work in the 90s to it.infosec will have job security for decades
(DIR) Post #ATfGf4MrB2y32PLCcq by harpaa01@mastodon.social
2023-03-16T06:21:32Z
0 likes, 0 repeats
@simon i am eagerly awaiting the first web extension that doesn’t just remove ads, but cleans web pages of all their bullshit, leaving a clean reading experience.
(DIR) Post #ATfMfxy73OfrCpZtuy by simon@fedi.simonwillison.net
2023-03-16T07:29:10Z
0 likes, 0 repeats
@szbalint it's so much worse than SQL injection though... because SQL injection has an obvious and easy fix! https://simonwillison.net/2022/Sep/16/prompt-injection-solutions/
(DIR) Post #ATfP1zSGPMvU5wJJHk by szbalint@x0r.be
2023-03-16T07:55:25Z
0 likes, 0 repeats
@simon what you wrote in the last paragraph there is exactly how sql injection was mitigated: separation of queries from input or in the LLM case instructions from prompt
(DIR) Post #ATfSvM9E2OdeGVUum0 by toychicken@mastodon.social
2023-03-16T08:39:12Z
0 likes, 0 repeats
@simon I'm thinking I might need a new variable in my meta tags, like`<meta name=”robots” content=”notraining”>`Means they can only index pages when it's not raining...
(DIR) Post #ATfbAkhBqoxQNNEjbs by Runixo@mstdn.social
2023-03-16T10:11:31Z
0 likes, 0 repeats
@simon re semantic web in gpt times: we could generate URIs for OpenAI embeddings, then owl:sameAs to existing vocabularies [...] profit
(DIR) Post #ATfhL4DgiNrPY9jsZ6 by cigitalgem@sigmoid.social
2023-03-16T11:20:35Z
0 likes, 0 repeats
@simon excellent #MLsec
(DIR) Post #ATfudLzChIZ7Gg0wHQ by simon@fedi.simonwillison.net
2023-03-16T13:49:33Z
0 likes, 0 repeats
@szbalint sure - problem is it's not at all obvious that it's possible to implement that pattern against large language modelsWhen all your algorithm can do is predict the next word in a sequence, how do you robustly have it treat some words as more influential than others?
(DIR) Post #ATg0k8cXO7oJZF1wzA by davew@mastodon.social
2023-03-16T14:58:05Z
0 likes, 0 repeats
@simon and think about all the wasted time trying to figure out how to get people to encode the data into the html. it was totally predictable that the solution would wait until that wasn't necessary because people were *never* going to do it.
(DIR) Post #ATg18dXQ5bHQPFJoOm by thatguy@statisticallyhuman.net
2023-03-16T15:02:35Z
0 likes, 0 repeats
@simon my personal excitement is for a true AI assistant. 3.5 was close to what you’d need, but 4 looks like it’s basically there - read your email and tell you your boss wants a 9am but it conflicts with your doctors appointment then offer to write back to either one….
(DIR) Post #ATg8nKITEPhuHbfwe0 by tavis@mas.to
2023-03-16T16:10:18Z
0 likes, 0 repeats
@simon that’s what we’re doing at https://kadoa.com :)
(DIR) Post #ATgH0xSDkwtJ7oFO6a by lobrien@sigmoid.social
2023-03-16T17:59:49Z
0 likes, 0 repeats
@simon @davew As a former writer of API docs, the part of the demo where they copy-pasted API docs written after the model had been trained and said “Read this and rewrite the code” was stunning.
(DIR) Post #ATjMjeehZvWV6965po by mapto@qoto.org
2023-03-18T05:48:40Z
0 likes, 0 repeats
@simon what I find interesting is to understand whether algorithms are bad at URLs (to put the problem short) or ourselves better at catching them there. Could we try to generalise the problem? A URL is a reference and it's used when one refers to a related piece of text. Turns out algorithms are able to identify when a reference is due, but extremely bad at finding the correct reference. Would such generalisation make sense to you?
(DIR) Post #ATjOMSqVWIbhieglxA by simon@fedi.simonwillison.net
2023-03-18T06:06:50Z
0 likes, 0 repeats
@mapto I think there's a mismatch between what we expect and what the model doesAsking "write a story for me to publish at proposed-URL-with-keywords" is, to us, a completely different question from "summarize the content fetched from real-URL-that-exists" - to the model they are treated the same
(DIR) Post #ATolNcJF5dHI5yRorQ by coty@mastodon.social
2023-03-20T20:18:16Z
0 likes, 0 repeats
@simon I’m particularly interested in getting to kick the tires on GPT-4’s image processing to see what sorts of questions I can ask about rendered web pages for testing. I’d like to ask about page contents, differences between baseline and new screenshots, etc.