[HN Gopher] Show HN: Goopt - Search Engine for a Procedural Simu...
___________________________________________________________________
Show HN: Goopt - Search Engine for a Procedural Simulation of the
Web with GPT-3
Author : joken0x
Score : 146 points
Date : 2022-02-23 17:32 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| kelseyfrog wrote:
| I love this. It's the reification of the Dead-Internet Theory - a
| tangible artifact embodying the feeling that the internet was
| replaced by its own simulacrum powered by AI.[1] The existence of
| Goopt is the culmination of DIT as self-fulfilling prophecy. We
| can almost see the beginnings of an outline begin to form around
| an Internet Turing test. How well can we discern the real
| internet from the fake one. Consequently, what happens when the
| line becomes so blurred that we lose the ability to perceive the
| difference?
|
| 1.
| https://www.theatlantic.com/technology/archive/2021/08/dead-...
| joken0x wrote:
| I think there will come a point when the traditional internet
| will also become diluted in artificiality, content from the
| procedural web will start to creep into the traditional web,
| and we won't be able to distinguish. It's interesting to think
| of it in terms of Baudrillard's Simulacra and Simulation.
| kingcharles wrote:
| This is exactly what a bot would say.
| TechBro8615 wrote:
| We are still missing a decentralized identity and reputation
| layer. I imagine in the future, people will sign their
| comments with a key that proves they're a real human.
| klabb3 wrote:
| This. It's the elephant in the room for so many of our
| problems related to abuse, trolling, influence campaigns,
| garbage content, attribution etc etc.
|
| OTOH, is proof of personhood really what we want long term,
| or is that just a proxy? Hypothetically, if an AI is good &
| trustworthy enough, why not allow a higher "source rating"
| for that than low quality human content?
| Shared404 wrote:
| Relevant XKCD: https://xkcd.com/810/
| nextaccountic wrote:
| AI might mine cryptocurrencies on their own, and then buy
| keys from desperate humans, and post as if they were the
| owner of the keys.
| jazzyjackson wrote:
| Robots will of course have keys they can sign with, so
| perhaps we will rely on government digital ids such as he
| Estonian public key system. In that case, governments will
| have a monopoly on sockpuppet accounts.
|
| Has there been any successors to web of trust ?
| kelseyfrog wrote:
| I would simply ask GPT-n what to write next, copy-paste it
| into the text box, sign it with my key, and hit reply. Its
| ability to write better comments than me would serve to
| gain me reputation rather than reduce it as so often is the
| present case.
| jerf wrote:
| We're passed that point. Probably by a couple of years at a
| minimum. No sarcasm. Content farms are definitely using techs
| like this. GPT-3 is really good at generating text but still
| has some characteristic failures, and I encounter content
| farm web pages (despite my best efforts) that have clearly
| used it or something like it as a tool. Even just in the last
| couple of weeks I've been seeing some new, innovative content
| farms managing to pollute my search results that I've not
| seen before.
| krzat wrote:
| Oh boy, we need AI powered blockers that filter out this
| stuff.
| joken0x wrote:
| AI fighting against AI, everything will be AI.
| klabb3 wrote:
| This seems like a terribly difficult problem. I'm sure
| today there are some give-away signatures, but I put my
| money on the impersonators in the long run.
| kelseyfrog wrote:
| > I think there will come a point when the traditional
| internet will also become diluted in artificiality
|
| There is already a concern in corpora creation in ML/AI
| projects. Researchers would like very much to have human-only
| generated content when training models on internet-sourced
| text. People posting GPT-created output has the potential to
| taint these corpora and create all sorts of strange loopy
| feedback.
| joken0x wrote:
| That's right, the human creation is and will become more
| and more a very precious treasure.
| rm_-rf_slash wrote:
| Knowing people it wouldn't take long for incoherent
| generative syntax to become a meme and reinforce the human
| corpora with new syntactic slang.
| Rebelgecko wrote:
| Anathem by Neal Stephenson has a great subplot about this.
| The information age has been a bit stunted because there's
| too much crap[1] on the internet. Companies sprung up selling
| filters that would block websites with low-quality or AI
| generated information. Eventually these companies realized
| they could drum up business by generating low quality[2]
| content themselves, especially if they could get it past
| their competitor's filters. The end result was that the
| internet became a convoluted morass of bullshit and lies,
| difficult for non-experts to extract useful knowledge
|
| [1]: Or maybe CRAAP https://en.wikipedia.org/wiki/CRAAP_test
|
| [2]: They quickly realized the trick is make _high quality_
| low quality content. 100 pages of gibberish is way less
| effective than a convincing essay that happens to include a
| few key falsehoods.
| joken0x wrote:
| It seems to be a loop of information from which there is no
| way out, that devours the information and stirs it among
| piles of garbage that it generates without stopping.
| Labyrinth and crypt at the same time, growing and churning
| more all the time. I think we have to start a serious
| effort to collect and save the human creation, otherwise it
| will become a hidden treasure buried among layers of
| artificial garbage. Imagine, if it can be difficult for us
| to distinguish the human content from the synthetic, how
| will it be for future generations who will not even have
| the living context of our time? And who will already be
| more accustomed to these synthetic contents than to the
| properly human ones.
| kelseyfrog wrote:
| How will this change how we evaluate content? Will
| digital media be socially and culturally devalued? Will
| print media gain greater status? Live spoken word?
| Curated content with a reputation layer? Obviously it's
| pure speculation, but let's indulge for a moment.
| randomstring wrote:
| Cuil was way ahead of its time.
| https://news.ycombinator.com/item?id=1255122
|
| http://cuiltheory.wikidot.com/ Cuil Theory
| TehCorwiz wrote:
| I haven't thought of that in years. I gave up my crusade to
| revive use of the interrobang(!?) in writing a while ago.
|
| After reviewing the materials I see that Cuil Theory has come a
| bit further since I last read. I believe that Goopt would be
| somewhere around -2!? from Cuil theory itself, negative because
| it's literal reality, but distant because it's an abstract
| embodiment.
|
| Slightly off-topic. During my cursory reading I see that
| imaginary Cuil got fleshed out. I'd like a second opinion. The
| way it reads to me is that 'i!?' is almost the literal
| definition of solipsism.
| [deleted]
| zuzun wrote:
| So basically modern Google without ads.
| joken0x wrote:
| Exactly, content no longer revolves around monetization.
| ffhhj wrote:
| "When there is no monetization you are the product" (patent
| pending)
| skybrian wrote:
| In the event you think you're looking at a simulation of the
| Internet, maybe start out by checking if news, maps, and weather
| are realistic, to see how good their world simulation is. Live
| news video should be interesting too.
|
| But if your browser is compromised so that encryption doesn't
| work, I think you have bigger problems.
| Geee wrote:
| This is clearly the future. All information will be generated on
| the fly and tailored for you. AI can match your level of
| knowledge, your language, your preferred style etc. AI can
| simplify / extend topics on demand, and also generate
| illustrations and videos to help explain topics.
|
| I think most of the current form of pregenerated web with search
| actually becomes completely unnecessary, and it'll basically stop
| existing.
| joken0x wrote:
| You get it, man, that is exactly the question. It is time to
| think about the many possibilities, problems, dilemmas,
| paradoxes, etc. It is very interesting and disturbing at the
| same time.
| Geee wrote:
| It changes everything. Thanks for coming up with this. I have
| thought about AI generated content before but not in this
| way. I just realized that we don't need the content web; we
| need just raw data sources and AI that generates content on
| the fly, for the user. The AI works for and is directed by
| the user; that's why it actually can reduce gibberish and
| make information more accessible and useful. This sets it
| apart from the current crop of content generation bots.
| debdut wrote:
| It's so cool someone made this, but
|
| > The procedural web will be the future of the web. It will offer
| us infinite content
|
| Yup it'll be infinite "garbage"
| ushakov wrote:
| the current web isn't to far off with SEO articles and ad-video
| autoplay
| joken0x wrote:
| Exactly, it is inevitable, the traditional web will be
| diluted with the same garbage.
| robbedpeter wrote:
| Not necessarily - federated media, webs of trust, and
| diligent curation across many smaller communities could
| allow for something that replaces Twitter, reddit, and
| centralized media hubs. Search within that context is
| easier - p2p/torrent streaming with crypto incentivized
| seeding can scale distribution.
|
| The current state of adtech and near total surveillance
| isn't sustainable as more people wake up to the downsides,
| and as fake crap begins to accumulate.
|
| Decentralization of social media, advertising, e-commerce
| and other web 2.0 staples will be a natural evolution of
| technology. The story goes "under Google's model of the
| walled garden web, SEO, spam, and bots achieved parity in
| all content metrics except actual meaningfulness to the
| user." Despite having all the compute and talent you could
| possibly bring together, Google is failing to uphold its
| core technology. They incentivized bad faith behavior, and
| are reaping the consequences of that. The acceleration of
| seo hacking and artificial worthless content is
| asymmetrical to the acceleration of the capabilities and
| market model Google has created.
|
| A search engine can navigate self selected communities,
| human curated lists, and creatively bundle lists of lists
| to achieve high quality results based on actual humans self
| selecting and acting in their own interests. You can do
| things with higher quality classification and even provide
| regex over crawled data without huge technical barriers.
| Search agents will come about, whether locally or cloud
| hosted, and will eventually replace centralized engines
| like Google.
|
| There are non doomed visions of the future. Maybe we won't
| suffer a digital trashocalypse.
| mattnewton wrote:
| > The procedural web will be the future of the web.
|
| Isn't the "procedural web" built of mountains of (hopefully)
| human written content? How will the system get content about new
| subjects without the humans writing it? Isn't a system like GPT-3
| currently limited to reflecting the ground truth data it has
| seen?
| lumost wrote:
| For how long? Think of the marketing and censorship
| opportunities when you can directly tune not just the content
| that gets seen but also the content itself! Content is still at
| least somewhat robust to censorship as it's sometimes difficult
| to remove all references to a banned book. Imagine if banning
| content also automatically rewrote all references such that
| they no longer made reference to the content? Or if one could
| simply pay and have all reviews of a mediocre book changed to
| make it the greatest book ever?
|
| Note the above is a statement on some of the risks to a
| procedural web. Not a real market opportunity.
| [deleted]
| visarga wrote:
| You'd have to use a trusted language model to get you banned
| information.
| visarga wrote:
| > Isn't a system like GPT-3 currently limited to reflecting the
| ground truth data it has seen?
|
| This limitation went away recently. A variant called RETRO
| (Retrieval-Enhanced Transformer) can use a search engine to
| take in the exact information up to date [1], assuming you can
| curate your own text corpus. It's also 25x smaller.
|
| [1] https://deepmind.com/research/publications/2021/improving-
| la...
| robbedpeter wrote:
| Give it two years and we might have passable agents running
| on phones. There'll be a sufficiently powerful and small
| model that you can use with 8gb ram or less on desktop within
| a year.
|
| These first large language models are naive, unoptimized
| implementations of data structures we're learning to inspect
| and optimize. Something like retro that runs locally with a
| "just clever enough" service agent is so close to workable. I
| can't wait to see what happens in ML over the next two years,
| and who knows what kind of radical evolution the next big
| algorithm is going to bring.
| mattnewton wrote:
| That's really cool. But unless I am misunderstanding this,
| that still puts the burden on the existing web though right,
| it's just avoiding having to retrain the model? If there is
| no economical market for humans to produce new content about
| a topic how will the search engine find the "ground truth"
| content?
| visarga wrote:
| You might want to use a limited subset of the web, a
| curated list of sources or feeds. Apparently 1TB of text
| could be enough, just need to collect it or download it
| from a trusted source.
| mattnewton wrote:
| So, suppose there is a new kind of cocktail that is
| popular in bars near me that nobody has written about
| under it's new trendy name.
|
| How do I ask this system about the recipe, or the history
| of the cocktail? Someone has to write an article about
| it, right? How do they get paid if it gets scraped once
| and people go to the scraping model for the answer
| instead of visiting the original article's page?
| zitterbewegung wrote:
| If you put your OpenAI key and start running this then they will
| ban your account because it will be against their TOS.
|
| With some minor modifications you could port it to goose.ai and
| it isn't against their TOS.
|
| EDIT: Forking it here https://github.com/zitterbewegung/Goopt to
| add the functionality above.
| FrenchDevRemote wrote:
| hey, you beat me to it :D did you manage to get it to work? I
| tried on my own but got error messages
| zitterbewegung wrote:
| I'm getting error messages I think it's an issue where the
| endpoint is in a configuration where it is using the old
| openai value I think I have to recompile it with typescript
| or something .
| code51 wrote:
| I made it work but GPT-J doesn't respond in the same manner
| to the template prompts for search. Goopt cannot use and
| display GPT output from there as a search result.
| yoland68 wrote:
| You might have to enable billing, that was the issue for me
| zitterbewegung wrote:
| No it's a config error.
| zitterbewegung wrote:
| What did you change ?
| code51 wrote:
| I set the base and engine in openai-api for goose.ai
| completions return fine, I'm seeing them fine in the log.
| However they are unusable to make Goopt work. The end of
| search prompts mock JSON format.
|
| I get this \"content\": [{\"name\": \"Keto Diet\", \n
| \"typeClass\": {}, \n \"description\":[],\n
| \"contentImageUri\": null }]}
|
| GPT-J doesn't seem to compose similar broken JSON so
| formatResults returns empty.
| dschnurr wrote:
| OpenAI engineer here. Cloning this project and running it
| locally with your own API key does not violate any policies.
| However the way this project is configured, publishing it to
| the web would expose your API key in the client-side source
| code, which violates our policies since it would allow your
| account to be compromised.
| zitterbewegung wrote:
| Sorry I should have read the TOS
| joken0x wrote:
| I think that as long as you don't put a live version (where you
| don't require your own API Key) and you don't misuse the
| results, there shouldn't be a problem.
|
| Even so, it is a good idea to adjust, I'll keep an eye on your
| fork. Thanks!
| zitterbewegung wrote:
| Also, for GPT-3 an advanced query would be helpful maybe to
| give it examples.
| zitterbewegung wrote:
| It's fine for not being live but switching over to goose.ai
| allows you to make it live I have a home server I could host
| it on .
| joken0x wrote:
| I would like to put it live, although it would be necessary
| to review the terms to launch it live.
|
| If you can make any adjustments that respect the terms,
| contact me to see it together.
| zitterbewegung wrote:
| I also have GPUs that are mine so that will definitely
| have any TOS at all but I would have to mirror the OpenAI
| API.
| aghilmort wrote:
| very cool / congrats! was recently tweeting about same -- the
| potential to help humans search the existing web better beyond
| just keyword search, i.e., query rewriting, summarizing and
| extending existing content, etc.
|
| there's also the flip side around SEO spam, which is partly why
| founded Breeze, a newish topic search engine that leverages
| curation to hedge against the dark side of human / bot spam, etc.
|
| bottom line, love this, having worked with GPT-3 in past and the
| direct impact on day job, all things search
| joken0x wrote:
| Thanks man! It's a good idea, maybe a similar filter or curator
| will be needed for the procedural web, but for the dark part of
| the AI; disinformation, meaningless content, etc.
| chrisgp wrote:
| Does the full version of this require Strong AI to truly replace
| the internet? What level of AI is necessary to convincingly
| replicate human understanding and explanation of information?
| joken0x wrote:
| It is something that is still not clear to me, I see the
| difficulty of the task but also the rapid evolution of AI
| models. Maybe it will surprise us in not too long.
| [deleted]
| ameminator wrote:
| Unfortunately, does not come with a Gwenyth-Paltrow-scented
| candle.
| fudged71 wrote:
| This is an incredible idea. There are so many unexplored
| possibilities when you re-write and re-format the web.
|
| Can you mix procedural and static content? How can you verify
| accuracy of information? What if you could refine a web page's
| content just-in-time? Modifying the query and context etc.
|
| Through a lens of Roam/Notion: what if everything were a block
| that could be individually linked? what if every block could be
| edited by anybody? what if anyone could add links and annotations
| across pages? a blend of web and wiki?
| joken0x wrote:
| * Can you mix procedural and static content? Just the idea of
| the wiki is interesting here. Perhaps there could be a wiki
| that stores content in a static way, that is edited by users
| putting the best content they find on the procedural web. It
| would be a valuable place to find good ideas or ideas that we
| might not have thought of but someone else did. This could also
| serve as feedback for AI models. Although it is also true that
| we would not be able to distinguish if non-human opinions start
| to creep in and end up contaminating the site.
|
| * How can you verify accuracy of information? I think this is
| one of the main difficulties, as the AI would have to
| understand contexts and have a notion of truth, I think this
| would already start to touch the capacity of "consciousness".
|
| * What if you could refine a web page's content just-in-time?
| You will be able to do this for every part that you don't like
| enough and want something better, or just to see something
| different.
| amznbyebyebye wrote:
| Is there any use of ML to distinguish the AI generated dead
| internet from the real one?
| visarga wrote:
| For generated headlines humans are a coin toss, can't tell them
| apart. But transformers can reach 85% accuracy.
|
| https://aclanthology.org/2021.nlp4if-1.1.pdf
| [deleted]
| marmarama wrote:
| If GPT-3 can produce procedurally generated web content this
| convincing, search engines are screwed, right? We won't be able
| to find anything useful on any current search engine because
| there's no straightforward algorithmic way to tell useful content
| from endless link farms full of utterly convincing but totally
| useless content.
| skybrian wrote:
| When say "this convincing" what are you basing it on?
| Workaccount2 wrote:
| In the story of the library of Babel, the librarians live in
| despair because despite having access to all the worlds
| information, they also have access to all the worlds
| disinformation, and all mixed together there is no way to tell
| which is which.
| joken0x wrote:
| Very good reference. It is a problem that will remain.
| jerf wrote:
| Yes, I think we're still a couple of years from this becoming
| an intractable problem, but it's absolutely coming.
|
| Startup entrepreneurs in the mood for a Hail Mary play take
| note. How do you have a web search engine in a world where
| there no longer exists any algorithm for telling spam apart
| from real content? "Go back to the original Yahoo" is a decent
| start but certainly nowhere _near_ a complete answer in 2022!
|
| My guess is that it may not even take the form of what we have
| today, with an arbitrary text box. Maybe you have to go down to
| a specific category at least. Who knows. I sure don't. All I
| can say is that it sure looks to me like the spammers are only
| a year or two from effective total victory in the current
| paradigm.
| joken0x wrote:
| Yes, I think the same. Search engines try to match what exists
| (finitely, so it will always be a limitation) of previously
| created content with our query or need to know, while AI can
| generate and adjust the answer to what we need or want to know,
| even for our purpose, intellectual level, etc. Basically,
| tailored responses.
| [deleted]
| smrtinsert wrote:
| I'd happily pay 5 bucks a month for a search engine searching
| only a curated list of sites.
|
| Under such a system any company that begins producing spam
| could be removed, and we could go back to the lovely days of
| something simple like page rank being used to provide relevant
| results.
| TechBro8615 wrote:
| I would pay for this service too, but only if the list was
| personal to me, and I could add or remove sites from it.
|
| It would also be cool if I could upload my own crawling
| modules, so I could index more than just websites.
| fudged71 wrote:
| I'm now convinced that Google will show artificial results as
| an amalgamation of the other results and pocket the ad views
| for themselves. It's the logical conclusion isn't it? Question
| is how they would distinguish those results in search.
| kingcharles wrote:
| It makes perfect sense. I guess the crux of Google Search is
| to give you an answer to a question. Do you care who gives
| you the answer as long as it is right?
| moffkalast wrote:
| At least we can still use Google to search Reddit.
| jay00 wrote:
| Until all reddit posts will be gpt-3 generated.
| kelseyfrog wrote:
| It should be possible to train an upvote prediction model
| conditioned on submission title. This could then be used to
| optimize GPT-3-family models to produce text which had the
| highest predicted upvote response. It's a couple-weekend
| project and I'd be surprised if an AI-hobbiest hadn't done
| it already.
| jazzyjackson wrote:
| In the trivial case, karma farming bots just keep a
| database of all Reddit history (it is a public dataset,
| few hundred gigabytes) and repost the top comments (top
| threads even) whenever they detect a reposted link (extra
| points for similarity / reverse image searching)
|
| It's a project I have on the back burner to analyze
| Reddit history to check what ratio of comments are
| actually original, and I'd like to build a link
| aggregator that sorts by novelty.
| kingcharles wrote:
| I've thought about this too, and the fact that I've not
| seen such a bot so far is pretty unbelievable. It's not a
| huge amount of work to code it. Working across the whole
| of Reddit (or HN for that matter), it would gather an
| ungodly amount of karma (and awards) in a small amount of
| time.
| dividuum wrote:
| You mean this reddit?
| https://old.reddit.com/r/SubredditSimulator/
| marstall wrote:
| total gibberish?
| mgdlbp wrote:
| The original uses Markov chains, was usurped a couple of
| years ago by https://old.reddit.com/r/SubSimulatorGPT2/
| sqs wrote:
| Haha, this is an amazing concept. It feels like a satire or an
| art piece. I love it, but it kind of gives me the "is the world
| real?" feeling.
___________________________________________________________________
(page generated 2022-02-23 23:00 UTC)