[HN Gopher] Show HN: Goopt - Search Engine for a Procedural Simu...
       ___________________________________________________________________
        
       Show HN: Goopt - Search Engine for a Procedural Simulation of the
       Web with GPT-3
        
       Author : joken0x
       Score  : 146 points
       Date   : 2022-02-23 17:32 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | kelseyfrog wrote:
       | I love this. It's the reification of the Dead-Internet Theory - a
       | tangible artifact embodying the feeling that the internet was
       | replaced by its own simulacrum powered by AI.[1] The existence of
       | Goopt is the culmination of DIT as self-fulfilling prophecy. We
       | can almost see the beginnings of an outline begin to form around
       | an Internet Turing test. How well can we discern the real
       | internet from the fake one. Consequently, what happens when the
       | line becomes so blurred that we lose the ability to perceive the
       | difference?
       | 
       | 1.
       | https://www.theatlantic.com/technology/archive/2021/08/dead-...
        
         | joken0x wrote:
         | I think there will come a point when the traditional internet
         | will also become diluted in artificiality, content from the
         | procedural web will start to creep into the traditional web,
         | and we won't be able to distinguish. It's interesting to think
         | of it in terms of Baudrillard's Simulacra and Simulation.
        
           | kingcharles wrote:
           | This is exactly what a bot would say.
        
           | TechBro8615 wrote:
           | We are still missing a decentralized identity and reputation
           | layer. I imagine in the future, people will sign their
           | comments with a key that proves they're a real human.
        
             | klabb3 wrote:
             | This. It's the elephant in the room for so many of our
             | problems related to abuse, trolling, influence campaigns,
             | garbage content, attribution etc etc.
             | 
             | OTOH, is proof of personhood really what we want long term,
             | or is that just a proxy? Hypothetically, if an AI is good &
             | trustworthy enough, why not allow a higher "source rating"
             | for that than low quality human content?
        
               | Shared404 wrote:
               | Relevant XKCD: https://xkcd.com/810/
        
             | nextaccountic wrote:
             | AI might mine cryptocurrencies on their own, and then buy
             | keys from desperate humans, and post as if they were the
             | owner of the keys.
        
             | jazzyjackson wrote:
             | Robots will of course have keys they can sign with, so
             | perhaps we will rely on government digital ids such as he
             | Estonian public key system. In that case, governments will
             | have a monopoly on sockpuppet accounts.
             | 
             | Has there been any successors to web of trust ?
        
             | kelseyfrog wrote:
             | I would simply ask GPT-n what to write next, copy-paste it
             | into the text box, sign it with my key, and hit reply. Its
             | ability to write better comments than me would serve to
             | gain me reputation rather than reduce it as so often is the
             | present case.
        
           | jerf wrote:
           | We're passed that point. Probably by a couple of years at a
           | minimum. No sarcasm. Content farms are definitely using techs
           | like this. GPT-3 is really good at generating text but still
           | has some characteristic failures, and I encounter content
           | farm web pages (despite my best efforts) that have clearly
           | used it or something like it as a tool. Even just in the last
           | couple of weeks I've been seeing some new, innovative content
           | farms managing to pollute my search results that I've not
           | seen before.
        
             | krzat wrote:
             | Oh boy, we need AI powered blockers that filter out this
             | stuff.
        
               | joken0x wrote:
               | AI fighting against AI, everything will be AI.
        
               | klabb3 wrote:
               | This seems like a terribly difficult problem. I'm sure
               | today there are some give-away signatures, but I put my
               | money on the impersonators in the long run.
        
           | kelseyfrog wrote:
           | > I think there will come a point when the traditional
           | internet will also become diluted in artificiality
           | 
           | There is already a concern in corpora creation in ML/AI
           | projects. Researchers would like very much to have human-only
           | generated content when training models on internet-sourced
           | text. People posting GPT-created output has the potential to
           | taint these corpora and create all sorts of strange loopy
           | feedback.
        
             | joken0x wrote:
             | That's right, the human creation is and will become more
             | and more a very precious treasure.
        
             | rm_-rf_slash wrote:
             | Knowing people it wouldn't take long for incoherent
             | generative syntax to become a meme and reinforce the human
             | corpora with new syntactic slang.
        
           | Rebelgecko wrote:
           | Anathem by Neal Stephenson has a great subplot about this.
           | The information age has been a bit stunted because there's
           | too much crap[1] on the internet. Companies sprung up selling
           | filters that would block websites with low-quality or AI
           | generated information. Eventually these companies realized
           | they could drum up business by generating low quality[2]
           | content themselves, especially if they could get it past
           | their competitor's filters. The end result was that the
           | internet became a convoluted morass of bullshit and lies,
           | difficult for non-experts to extract useful knowledge
           | 
           | [1]: Or maybe CRAAP https://en.wikipedia.org/wiki/CRAAP_test
           | 
           | [2]: They quickly realized the trick is make _high quality_
           | low quality content. 100 pages of gibberish is way less
           | effective than a convincing essay that happens to include a
           | few key falsehoods.
        
             | joken0x wrote:
             | It seems to be a loop of information from which there is no
             | way out, that devours the information and stirs it among
             | piles of garbage that it generates without stopping.
             | Labyrinth and crypt at the same time, growing and churning
             | more all the time. I think we have to start a serious
             | effort to collect and save the human creation, otherwise it
             | will become a hidden treasure buried among layers of
             | artificial garbage. Imagine, if it can be difficult for us
             | to distinguish the human content from the synthetic, how
             | will it be for future generations who will not even have
             | the living context of our time? And who will already be
             | more accustomed to these synthetic contents than to the
             | properly human ones.
        
               | kelseyfrog wrote:
               | How will this change how we evaluate content? Will
               | digital media be socially and culturally devalued? Will
               | print media gain greater status? Live spoken word?
               | Curated content with a reputation layer? Obviously it's
               | pure speculation, but let's indulge for a moment.
        
       | randomstring wrote:
       | Cuil was way ahead of its time.
       | https://news.ycombinator.com/item?id=1255122
       | 
       | http://cuiltheory.wikidot.com/ Cuil Theory
        
         | TehCorwiz wrote:
         | I haven't thought of that in years. I gave up my crusade to
         | revive use of the interrobang(!?) in writing a while ago.
         | 
         | After reviewing the materials I see that Cuil Theory has come a
         | bit further since I last read. I believe that Goopt would be
         | somewhere around -2!? from Cuil theory itself, negative because
         | it's literal reality, but distant because it's an abstract
         | embodiment.
         | 
         | Slightly off-topic. During my cursory reading I see that
         | imaginary Cuil got fleshed out. I'd like a second opinion. The
         | way it reads to me is that 'i!?' is almost the literal
         | definition of solipsism.
        
       | [deleted]
        
       | zuzun wrote:
       | So basically modern Google without ads.
        
         | joken0x wrote:
         | Exactly, content no longer revolves around monetization.
        
         | ffhhj wrote:
         | "When there is no monetization you are the product" (patent
         | pending)
        
       | skybrian wrote:
       | In the event you think you're looking at a simulation of the
       | Internet, maybe start out by checking if news, maps, and weather
       | are realistic, to see how good their world simulation is. Live
       | news video should be interesting too.
       | 
       | But if your browser is compromised so that encryption doesn't
       | work, I think you have bigger problems.
        
       | Geee wrote:
       | This is clearly the future. All information will be generated on
       | the fly and tailored for you. AI can match your level of
       | knowledge, your language, your preferred style etc. AI can
       | simplify / extend topics on demand, and also generate
       | illustrations and videos to help explain topics.
       | 
       | I think most of the current form of pregenerated web with search
       | actually becomes completely unnecessary, and it'll basically stop
       | existing.
        
         | joken0x wrote:
         | You get it, man, that is exactly the question. It is time to
         | think about the many possibilities, problems, dilemmas,
         | paradoxes, etc. It is very interesting and disturbing at the
         | same time.
        
           | Geee wrote:
           | It changes everything. Thanks for coming up with this. I have
           | thought about AI generated content before but not in this
           | way. I just realized that we don't need the content web; we
           | need just raw data sources and AI that generates content on
           | the fly, for the user. The AI works for and is directed by
           | the user; that's why it actually can reduce gibberish and
           | make information more accessible and useful. This sets it
           | apart from the current crop of content generation bots.
        
       | debdut wrote:
       | It's so cool someone made this, but
       | 
       | > The procedural web will be the future of the web. It will offer
       | us infinite content
       | 
       | Yup it'll be infinite "garbage"
        
         | ushakov wrote:
         | the current web isn't to far off with SEO articles and ad-video
         | autoplay
        
           | joken0x wrote:
           | Exactly, it is inevitable, the traditional web will be
           | diluted with the same garbage.
        
             | robbedpeter wrote:
             | Not necessarily - federated media, webs of trust, and
             | diligent curation across many smaller communities could
             | allow for something that replaces Twitter, reddit, and
             | centralized media hubs. Search within that context is
             | easier - p2p/torrent streaming with crypto incentivized
             | seeding can scale distribution.
             | 
             | The current state of adtech and near total surveillance
             | isn't sustainable as more people wake up to the downsides,
             | and as fake crap begins to accumulate.
             | 
             | Decentralization of social media, advertising, e-commerce
             | and other web 2.0 staples will be a natural evolution of
             | technology. The story goes "under Google's model of the
             | walled garden web, SEO, spam, and bots achieved parity in
             | all content metrics except actual meaningfulness to the
             | user." Despite having all the compute and talent you could
             | possibly bring together, Google is failing to uphold its
             | core technology. They incentivized bad faith behavior, and
             | are reaping the consequences of that. The acceleration of
             | seo hacking and artificial worthless content is
             | asymmetrical to the acceleration of the capabilities and
             | market model Google has created.
             | 
             | A search engine can navigate self selected communities,
             | human curated lists, and creatively bundle lists of lists
             | to achieve high quality results based on actual humans self
             | selecting and acting in their own interests. You can do
             | things with higher quality classification and even provide
             | regex over crawled data without huge technical barriers.
             | Search agents will come about, whether locally or cloud
             | hosted, and will eventually replace centralized engines
             | like Google.
             | 
             | There are non doomed visions of the future. Maybe we won't
             | suffer a digital trashocalypse.
        
       | mattnewton wrote:
       | > The procedural web will be the future of the web.
       | 
       | Isn't the "procedural web" built of mountains of (hopefully)
       | human written content? How will the system get content about new
       | subjects without the humans writing it? Isn't a system like GPT-3
       | currently limited to reflecting the ground truth data it has
       | seen?
        
         | lumost wrote:
         | For how long? Think of the marketing and censorship
         | opportunities when you can directly tune not just the content
         | that gets seen but also the content itself! Content is still at
         | least somewhat robust to censorship as it's sometimes difficult
         | to remove all references to a banned book. Imagine if banning
         | content also automatically rewrote all references such that
         | they no longer made reference to the content? Or if one could
         | simply pay and have all reviews of a mediocre book changed to
         | make it the greatest book ever?
         | 
         | Note the above is a statement on some of the risks to a
         | procedural web. Not a real market opportunity.
        
           | [deleted]
        
           | visarga wrote:
           | You'd have to use a trusted language model to get you banned
           | information.
        
         | visarga wrote:
         | > Isn't a system like GPT-3 currently limited to reflecting the
         | ground truth data it has seen?
         | 
         | This limitation went away recently. A variant called RETRO
         | (Retrieval-Enhanced Transformer) can use a search engine to
         | take in the exact information up to date [1], assuming you can
         | curate your own text corpus. It's also 25x smaller.
         | 
         | [1] https://deepmind.com/research/publications/2021/improving-
         | la...
        
           | robbedpeter wrote:
           | Give it two years and we might have passable agents running
           | on phones. There'll be a sufficiently powerful and small
           | model that you can use with 8gb ram or less on desktop within
           | a year.
           | 
           | These first large language models are naive, unoptimized
           | implementations of data structures we're learning to inspect
           | and optimize. Something like retro that runs locally with a
           | "just clever enough" service agent is so close to workable. I
           | can't wait to see what happens in ML over the next two years,
           | and who knows what kind of radical evolution the next big
           | algorithm is going to bring.
        
           | mattnewton wrote:
           | That's really cool. But unless I am misunderstanding this,
           | that still puts the burden on the existing web though right,
           | it's just avoiding having to retrain the model? If there is
           | no economical market for humans to produce new content about
           | a topic how will the search engine find the "ground truth"
           | content?
        
             | visarga wrote:
             | You might want to use a limited subset of the web, a
             | curated list of sources or feeds. Apparently 1TB of text
             | could be enough, just need to collect it or download it
             | from a trusted source.
        
               | mattnewton wrote:
               | So, suppose there is a new kind of cocktail that is
               | popular in bars near me that nobody has written about
               | under it's new trendy name.
               | 
               | How do I ask this system about the recipe, or the history
               | of the cocktail? Someone has to write an article about
               | it, right? How do they get paid if it gets scraped once
               | and people go to the scraping model for the answer
               | instead of visiting the original article's page?
        
       | zitterbewegung wrote:
       | If you put your OpenAI key and start running this then they will
       | ban your account because it will be against their TOS.
       | 
       | With some minor modifications you could port it to goose.ai and
       | it isn't against their TOS.
       | 
       | EDIT: Forking it here https://github.com/zitterbewegung/Goopt to
       | add the functionality above.
        
         | FrenchDevRemote wrote:
         | hey, you beat me to it :D did you manage to get it to work? I
         | tried on my own but got error messages
        
           | zitterbewegung wrote:
           | I'm getting error messages I think it's an issue where the
           | endpoint is in a configuration where it is using the old
           | openai value I think I have to recompile it with typescript
           | or something .
        
             | code51 wrote:
             | I made it work but GPT-J doesn't respond in the same manner
             | to the template prompts for search. Goopt cannot use and
             | display GPT output from there as a search result.
        
           | yoland68 wrote:
           | You might have to enable billing, that was the issue for me
        
             | zitterbewegung wrote:
             | No it's a config error.
        
             | zitterbewegung wrote:
             | What did you change ?
        
               | code51 wrote:
               | I set the base and engine in openai-api for goose.ai
               | completions return fine, I'm seeing them fine in the log.
               | However they are unusable to make Goopt work. The end of
               | search prompts mock JSON format.
               | 
               | I get this \"content\": [{\"name\": \"Keto Diet\", \n
               | \"typeClass\": {}, \n \"description\":[],\n
               | \"contentImageUri\": null }]}
               | 
               | GPT-J doesn't seem to compose similar broken JSON so
               | formatResults returns empty.
        
         | dschnurr wrote:
         | OpenAI engineer here. Cloning this project and running it
         | locally with your own API key does not violate any policies.
         | However the way this project is configured, publishing it to
         | the web would expose your API key in the client-side source
         | code, which violates our policies since it would allow your
         | account to be compromised.
        
           | zitterbewegung wrote:
           | Sorry I should have read the TOS
        
         | joken0x wrote:
         | I think that as long as you don't put a live version (where you
         | don't require your own API Key) and you don't misuse the
         | results, there shouldn't be a problem.
         | 
         | Even so, it is a good idea to adjust, I'll keep an eye on your
         | fork. Thanks!
        
           | zitterbewegung wrote:
           | Also, for GPT-3 an advanced query would be helpful maybe to
           | give it examples.
        
           | zitterbewegung wrote:
           | It's fine for not being live but switching over to goose.ai
           | allows you to make it live I have a home server I could host
           | it on .
        
             | joken0x wrote:
             | I would like to put it live, although it would be necessary
             | to review the terms to launch it live.
             | 
             | If you can make any adjustments that respect the terms,
             | contact me to see it together.
        
               | zitterbewegung wrote:
               | I also have GPUs that are mine so that will definitely
               | have any TOS at all but I would have to mirror the OpenAI
               | API.
        
       | aghilmort wrote:
       | very cool / congrats! was recently tweeting about same -- the
       | potential to help humans search the existing web better beyond
       | just keyword search, i.e., query rewriting, summarizing and
       | extending existing content, etc.
       | 
       | there's also the flip side around SEO spam, which is partly why
       | founded Breeze, a newish topic search engine that leverages
       | curation to hedge against the dark side of human / bot spam, etc.
       | 
       | bottom line, love this, having worked with GPT-3 in past and the
       | direct impact on day job, all things search
        
         | joken0x wrote:
         | Thanks man! It's a good idea, maybe a similar filter or curator
         | will be needed for the procedural web, but for the dark part of
         | the AI; disinformation, meaningless content, etc.
        
       | chrisgp wrote:
       | Does the full version of this require Strong AI to truly replace
       | the internet? What level of AI is necessary to convincingly
       | replicate human understanding and explanation of information?
        
         | joken0x wrote:
         | It is something that is still not clear to me, I see the
         | difficulty of the task but also the rapid evolution of AI
         | models. Maybe it will surprise us in not too long.
        
       | [deleted]
        
       | ameminator wrote:
       | Unfortunately, does not come with a Gwenyth-Paltrow-scented
       | candle.
        
       | fudged71 wrote:
       | This is an incredible idea. There are so many unexplored
       | possibilities when you re-write and re-format the web.
       | 
       | Can you mix procedural and static content? How can you verify
       | accuracy of information? What if you could refine a web page's
       | content just-in-time? Modifying the query and context etc.
       | 
       | Through a lens of Roam/Notion: what if everything were a block
       | that could be individually linked? what if every block could be
       | edited by anybody? what if anyone could add links and annotations
       | across pages? a blend of web and wiki?
        
         | joken0x wrote:
         | * Can you mix procedural and static content? Just the idea of
         | the wiki is interesting here. Perhaps there could be a wiki
         | that stores content in a static way, that is edited by users
         | putting the best content they find on the procedural web. It
         | would be a valuable place to find good ideas or ideas that we
         | might not have thought of but someone else did. This could also
         | serve as feedback for AI models. Although it is also true that
         | we would not be able to distinguish if non-human opinions start
         | to creep in and end up contaminating the site.
         | 
         | * How can you verify accuracy of information? I think this is
         | one of the main difficulties, as the AI would have to
         | understand contexts and have a notion of truth, I think this
         | would already start to touch the capacity of "consciousness".
         | 
         | * What if you could refine a web page's content just-in-time?
         | You will be able to do this for every part that you don't like
         | enough and want something better, or just to see something
         | different.
        
       | amznbyebyebye wrote:
       | Is there any use of ML to distinguish the AI generated dead
       | internet from the real one?
        
         | visarga wrote:
         | For generated headlines humans are a coin toss, can't tell them
         | apart. But transformers can reach 85% accuracy.
         | 
         | https://aclanthology.org/2021.nlp4if-1.1.pdf
        
         | [deleted]
        
       | marmarama wrote:
       | If GPT-3 can produce procedurally generated web content this
       | convincing, search engines are screwed, right? We won't be able
       | to find anything useful on any current search engine because
       | there's no straightforward algorithmic way to tell useful content
       | from endless link farms full of utterly convincing but totally
       | useless content.
        
         | skybrian wrote:
         | When say "this convincing" what are you basing it on?
        
         | Workaccount2 wrote:
         | In the story of the library of Babel, the librarians live in
         | despair because despite having access to all the worlds
         | information, they also have access to all the worlds
         | disinformation, and all mixed together there is no way to tell
         | which is which.
        
           | joken0x wrote:
           | Very good reference. It is a problem that will remain.
        
         | jerf wrote:
         | Yes, I think we're still a couple of years from this becoming
         | an intractable problem, but it's absolutely coming.
         | 
         | Startup entrepreneurs in the mood for a Hail Mary play take
         | note. How do you have a web search engine in a world where
         | there no longer exists any algorithm for telling spam apart
         | from real content? "Go back to the original Yahoo" is a decent
         | start but certainly nowhere _near_ a complete answer in 2022!
         | 
         | My guess is that it may not even take the form of what we have
         | today, with an arbitrary text box. Maybe you have to go down to
         | a specific category at least. Who knows. I sure don't. All I
         | can say is that it sure looks to me like the spammers are only
         | a year or two from effective total victory in the current
         | paradigm.
        
         | joken0x wrote:
         | Yes, I think the same. Search engines try to match what exists
         | (finitely, so it will always be a limitation) of previously
         | created content with our query or need to know, while AI can
         | generate and adjust the answer to what we need or want to know,
         | even for our purpose, intellectual level, etc. Basically,
         | tailored responses.
        
         | [deleted]
        
         | smrtinsert wrote:
         | I'd happily pay 5 bucks a month for a search engine searching
         | only a curated list of sites.
         | 
         | Under such a system any company that begins producing spam
         | could be removed, and we could go back to the lovely days of
         | something simple like page rank being used to provide relevant
         | results.
        
           | TechBro8615 wrote:
           | I would pay for this service too, but only if the list was
           | personal to me, and I could add or remove sites from it.
           | 
           | It would also be cool if I could upload my own crawling
           | modules, so I could index more than just websites.
        
         | fudged71 wrote:
         | I'm now convinced that Google will show artificial results as
         | an amalgamation of the other results and pocket the ad views
         | for themselves. It's the logical conclusion isn't it? Question
         | is how they would distinguish those results in search.
        
           | kingcharles wrote:
           | It makes perfect sense. I guess the crux of Google Search is
           | to give you an answer to a question. Do you care who gives
           | you the answer as long as it is right?
        
         | moffkalast wrote:
         | At least we can still use Google to search Reddit.
        
           | jay00 wrote:
           | Until all reddit posts will be gpt-3 generated.
        
             | kelseyfrog wrote:
             | It should be possible to train an upvote prediction model
             | conditioned on submission title. This could then be used to
             | optimize GPT-3-family models to produce text which had the
             | highest predicted upvote response. It's a couple-weekend
             | project and I'd be surprised if an AI-hobbiest hadn't done
             | it already.
        
               | jazzyjackson wrote:
               | In the trivial case, karma farming bots just keep a
               | database of all Reddit history (it is a public dataset,
               | few hundred gigabytes) and repost the top comments (top
               | threads even) whenever they detect a reposted link (extra
               | points for similarity / reverse image searching)
               | 
               | It's a project I have on the back burner to analyze
               | Reddit history to check what ratio of comments are
               | actually original, and I'd like to build a link
               | aggregator that sorts by novelty.
        
               | kingcharles wrote:
               | I've thought about this too, and the fact that I've not
               | seen such a bot so far is pretty unbelievable. It's not a
               | huge amount of work to code it. Working across the whole
               | of Reddit (or HN for that matter), it would gather an
               | ungodly amount of karma (and awards) in a small amount of
               | time.
        
           | dividuum wrote:
           | You mean this reddit?
           | https://old.reddit.com/r/SubredditSimulator/
        
             | marstall wrote:
             | total gibberish?
        
               | mgdlbp wrote:
               | The original uses Markov chains, was usurped a couple of
               | years ago by https://old.reddit.com/r/SubSimulatorGPT2/
        
       | sqs wrote:
       | Haha, this is an amazing concept. It feels like a satire or an
       | art piece. I love it, but it kind of gives me the "is the world
       | real?" feeling.
        
       ___________________________________________________________________
       (page generated 2022-02-23 23:00 UTC)