[HN Gopher] Show HN: AskHN
___________________________________________________________________
Show HN: AskHN
Author : kvh
Score : 463 points
Date : 2023-02-22 16:09 UTC (6 hours ago)
(HTM) web link (www.patterns.app)
(TXT) w3m dump (www.patterns.app)
| AndrewKemendo wrote:
| First thing I saw is my answers to someones question.
|
| Can you cut me a distro of myself?
| [deleted]
| blueicelake121 wrote:
| [flagged]
| gus_massa wrote:
| It looks interesting, but posting it on random threads of HN
| will make users flag your post and mods ban your account.
|
| The post definitively needs more info! Who are you? How do you
| pick the kids? Are you the "teacher", an "organizer" or just
| someone enthusiastic that is related to the project? Programing
| language? Age of the kids? Have you done something similar
| before? Length of the course? Why do you need money?
|
| Try to write a long post answering all that questions and
| perhaps a few more, but not too long. Make a new post, and then
| make a comment explaining you are the [teacher or whatever],
| and be ready reply to the comments in the thread.
|
| Some official suggestions in
| https://news.ycombinator.com/newswelcome.html
| [deleted]
| jaequery wrote:
| Seeing a ton of projects utilizing chatgpt nowadays. Are the
| project owners basically paying the API costs out of pocket ?
| Think it would add up pretty quick especially if front page on
| HN.
| PetrBrzyBrzek wrote:
| Hi, thanks for the interesting article. I have a question about
| Pinecone. What is the cost of storing all these vectors?
| tomr75 wrote:
| could you do this for medical journal articles?
| all2 wrote:
| You'd probably need to prepend a prompt that told the bot how
| to analyze experiment design. Maybe have it read a book or 10
| on experiment design. Also a few books on social networks,
| financial motivations and other human factors in science.
| _Then_ let it take a look at journal articles and their
| metadata. In short, you need a way to vet for quality.
| leobg wrote:
| I like the project. Had been wanting to do this myself for a long
| time, because HN has become the first place I go to nowadays for
| answers, and I value the intelligence and experience distilled in
| the comments here.
|
| I do not like that it seems to be effectively an ad.
|
| > Embedding every single one of the 6.5 eligible comments was
| prohibitively time-consuming and expensive (12 hours and
| ~$2,000).
|
| Does anybody understand what he's talking about here? Assuming
| 6.5 million comments and an average token length of 70 we'd be
| looking at $180 ($0.0004 / 1K tokens).
| jerpint wrote:
| Nice! We built something very similar recently, it is more like
| "Ask your documentation" but very similar implementations
| otherwise.
|
| See a demo on the huggingface transformers documentation:
| https://huggingface.co/spaces/jerpint/buster
|
| code: https://github.com/jerpint/buster
| jn2clark wrote:
| As did we! It works very well. Article:
| https://github.com/marqo-ai/marqo/blob/mainline/examples/GPT...
| and code:https://github.com/marqo-
| ai/marqo/tree/mainline/examples/GPT...
| freeqaz wrote:
| Starred! We've been looking to build something similar so I
| appreciate you sharing this here.
|
| The only other project that I've seen that's doing something
| close to this is this one: https://github.com/getbuff/Buff
|
| It's a bit more similar to the OPs bot (it's a Discord bit).
|
| Cool to see momentum in this space!
| clark-kent wrote:
| > 4. Index the embeddings in a database
|
| If Op is reading. I'm curious about the database you are using to
| store the embeddings. Pinecone, Weaviate ...?
| gk1 wrote:
| From the article:
|
| > The embeddings were then indexed with Pinecone.
| monkeydust wrote:
| Nice work! Been playing with Langchain and was not aware of
| patterns.app.
|
| This whole space is moving so fast its hard to keep up for
| someone whos immediate day job doesn't revolve around this space.
| Congrats.
| motohagiography wrote:
| Nice. I just sort of assumed early on my comments were training
| some future AI, and I hope that in some small way I have been
| able to moderate some of its stupider urges.
|
| A version where you can turn knobs of flavored contributors would
| be pretty funny. I know my comment style is easily identifiable
| and reproducable, and it encodes a certain type of logical
| conjugation, albeit biased with some principles and trigger
| topics, and I think there is enough material on HN that there may
| be such a thing as a distinct, motohagiographic lens. :)
| jawadch93 wrote:
| [dead]
| smsm42 wrote:
| How to get a job at Google? Oh, that's easy, just get a PhD.
|
| Thanks bottie, very use, much helpful.
| tosemlep wrote:
| Did you also ingest dead comments to the corpus?
|
| I would very much like to see the ghost of Terry pop up from time
| to time, to offer his wisdom and unique style of response.
| MuffinFlavored wrote:
| > I trained on a corpus of over 6.5 million Hacker News comments
|
| How long did it take to scrape them and train the "corpus" on
| this content?
| tta wrote:
| https://console.cloud.google.com/marketplace/details/y-combi...
| SalimoS wrote:
| That was mentioned in the article In the << Ingesting and
| filtering HN corpus >> ... 30min ...
| gnicholas wrote:
| Love that it includes sources -- this makes it much more valuable
| because you can tell if it's giving useful information or just
| blowing smoke.
| boywitharupee wrote:
| For those who are wondering,
|
| HN data is indexed with embeddings for semantic search. When
| queried, it finds closest article, top comments and summarizes
| with GPT-3.
|
| GPT-3 serves as a rendering tool for compressed comments.
| SilverBirch wrote:
| This might be a dumb question, but is this based on the
| _collective_ wisdom of HN. Because I would say that the
| collective wisdom is just as much in the interaction of the
| comments and the ranking of those comments as it is in the
| comments themselves. If you just injest all the comments
| wholesale, aren 't you rather getting the _average_ wisdom of HN?
| inportb wrote:
| I believe it's always going to be _an average_. The more
| interesting question is how is the average weighted?
| setgree wrote:
| As is often true of GPT responses, there's some nonsense
| interspersed here, e.g. the claim that R has "a more mature
| package universe" than Python.
|
| I think this is false, but if you're reading quickly, it sounds
| cogent enough. As Sarah Constantin observed about GPT2 [0]:
|
| > if you skim text, you miss obvious absurdities. The point is
| OpenAI HAS achieved the ability to pass the Turing test against
| humans on autopilot...The mental motion of "I didn't really parse
| that paragraph, but sure, whatever, I'll take the author's word
| for it" is, in my introspective experience, absolutely identical
| to "I didn't really parse that paragraph because it was bot-
| generated and didn't make any sense so I couldn't possibly have
| parsed it", except that in the first case, I assume that the
| error lies with me rather than the text. This is not a safe
| assumption in a post-GPT2 world. Instead of "default to humility"
| (assume that when you don't understand a passage, the passage is
| true and you're just missing something) the ideal mental action
| in a world full of bots is "default to null" (if you don't
| understand a passage, assume you're in the same epistemic state
| as if you'd never read it at all.)
|
| [0] https://www.skynettoday.com/editorials/humans-not-
| concentrat...
| shanusmagnus wrote:
| I wonder if the "default to humility" heuristic probably does
| more harm than good on net, because the people who heed it
| probably shouldn't, and the ones who should won't.
| leereeves wrote:
| > there's some nonsense interspersed here, e.g. the claim that
| R has "a more mature package universe" than Python.
|
| As a programmer, I find R hard to use and not very well
| designed, so I can see why you'd call that nonsense.
|
| But when I was a math student, I found that in some ways R does
| have "a more mature package universe". There were many math
| algorithms that I could find packages for in R and not in
| Python, even as a mere grad student.
| setgree wrote:
| Absolutely, for statistics and visualization I think R and
| its packages are (sometimes) superior. But GPT responses
| don't generally offer those kinds of nuances; the claim is
| that the packages are "more mature," period. And it's for
| good reason that the _most_ mature Python packages, e.g.
| numpy and pandas, are used by data scientists in production
| pretty much everywhere.
| ux-app wrote:
| amazingly, your comment will eventually be added to the
| CHatGPT corpus and at some point down the line may be used
| to add the nuance that's currently lacking :)
| hbn wrote:
| Amazing, an AI that is incapable of picking up on jokes or
| sarcasm!
| DanielBMarkham wrote:
| Related question: I've written probably a million words over my
| lifetime.
|
| Is there an easy way to load up GPT with my thoughts to have it
| be a fake me?
| mittermayr wrote:
| This I think would be a great little SaaS idea to make some
| money. I keep seeing more and more people askingn how they can
| transform their data into an interactive archive that responds
| as chat, or with voice.
| flappysolutions wrote:
| I'm quite pleasantly surprised by this response:
|
| Question: what are the biggest fallacies of people that post on
| hackernews?
|
| Answer: One of the biggest fallacies of people that post on
| Hacker News is the nirvana fallacy, which is when someone jumps
| from "thing x is imperfect" to "thing x is worthless" without
| considering any of the benefits. People also tend to be familiar
| with certain technologies and will discount the flaws of existing
| technology vs. the flaws of new technology. Another fallacy is
| the assumption that there is one best tool for doing things - the
| one that supports the largest amount of use cases. Additionally,
| people may use appeals to authority, invent their own solutions
| rather than use existing ones, and assume that the approaches
| they first learned are most correct. Finally, people may assume
| open source software is a gift and that dependencies don't need
| to be tested.
| ducktective wrote:
| It kinda enumerated all possible characteristics of people and
| their approaches to technical issues, not something unique to
| HN.
| egypturnash wrote:
| Is there a way to opt out of one's comments being used for this?
| olivierduval wrote:
| I agree: when I signed in, I never agreed to let anybody use
| what I write to do anything they want ! I only agreed to let
| everybody read, understand, interact with what I wrote
|
| Actually, it makes me feel as bad as knowing that CAPTCHA were
| used to train image recognition models...
|
| I think it could be a good time to reconsider the question of
| the consent. I may agree that my words are used to train some
| IA... but 1) I must be asked (kindly) first and 2) it won't be
| free!!! (it may be paid to me or the service provider like
| HN... but it's NOT unpaid work ;-) )
| [deleted]
| Minor49er wrote:
| Just post a healthy amount of random nonsense along with any of
| your actual posts to dilute the effects
|
| Banana Sebastian housewares fly swimmingly under terrestrial
| Zruodroru'th Memphis Steve Jobs archipelagos
| triyambakam wrote:
| > Banana Sebastian housewares fly swimmingly under
| terrestrial Zruodroru'th Memphis Steve Jobs archipelagos
|
| It's actually more likely to require a bathtub to increase
| the volume of the reticulated lorries, so I really don't
| think a farmer's market is the ideal place.
| jdthedisciple wrote:
| Why would you want to? Genuinely wondering.
|
| I for one am oh so proud that my valuable ramblings contributed
| to this majestic machinery.
| andai wrote:
| If you're willing to pay for the retraining? ;)
| hombre_fatal wrote:
| Yes, don't post on online forums.
| samstave wrote:
| Thats how I decided to opt-out of reddit after 16 years.
| anaganisk wrote:
| Nah, it's no big a deal, its not like cambridge analytica will
| happen again. They're just using your data to train AI. Who
| knows may be based on the way you comment, you may get
| suggestions on which medication you need, or if it's time for
| the Redbull/starbucks coffee. Nah, all is good. Nothing bad
| will happen in allowing companies to scrape comments and build
| models. They're very ethical. In fact, people here are suddenly
| not so concerned that the model is not open. There is no
| oversight on how data is being used They are just proud to get
| answers from a text generator.
| olivierduval wrote:
| The BIG DEAL is not THAT specific instance but the fact that
| the ML crowd think it's OK to take everything without even
| asking permission
| [deleted]
| anaganisk wrote:
| I think I should've put an /s at the end. Its kind of
| strange that I see constant discussions here and people
| harrassing small apps/libraries about how their error
| collection is not OPT-IN. The whole audacity debacle. But
| data collection for training ML models is perfectly fine
| because we sure do know the companies who fund the
| research, how they will get an ROI.
| m3kw9 wrote:
| I thought chatgpt may already used hacker news (Reddit) to train?
| helsontaveras18 wrote:
| Now that we have this bot to answer questions for us, I think we
| can all go home!
| have_faith wrote:
| It seems to write in the generic "style" of GPT, instead of in
| the style I would recognise as a HN poster. Is that because of
| something baked into how the training process works? It lacks a
| sort of casualness or air of superiority ;)
| clark-kent wrote:
| > ii. Compute embeddings and similarity and choose top K
| comments closest to question
|
| > iii. Put top matching comments into a prompt and ask GPT-3 to
| answer the question using the context
|
| It depends on the Prompt used to ask GPT the question. A prompt
| that instructs GPT to write like a HN poster should fix that.
| sebzim4500 wrote:
| There was no training process, this is just running GPT with
| relevant HN comments as part of the prompt.
|
| If he wanted it to replicate that classic HN feel he would
| either have to extend the prompt with additional examples or,
| better yet, use finetuning.
|
| I guess he could also just randomly sprinkle in some terms like
| 'stochastic parrot' and find a way to shoehorn Tesla FSD into
| every conversation about AI.
| [deleted]
| rpastuszak wrote:
| Last year (pre the chatGPT bonanza) I was using GPT-3 to
| generate some content about attribution bias and the
| responses got much spicier once the prompt started including
| the typical HN poster lingo, like "10x developer":
|
| https://sonnet.io/posts/emotive-
| conjugation/#:~:text=I%27m%2...
|
| My conclusion was that you can use LLMs to automate and scale
| attribution bias.
|
| We did it guys!
| btbuildem wrote:
| > "AskHN" is a GPT-3 bot I trained on a corpus of over 6.5
| million Hacker News comments to represent the collective
| wisdom of the HN community in a single bot.
|
| First sentence of the first paragraph on OP's page
|
| EDIT: it's a bit misleading, further down they describe what
| looks like a semantic-search approach
| agolio wrote:
| Scroll a bit further down and you will see
|
| > 7. Put top matching content into a prompt and ask GPT-3
| to summarize
|
| > 8. Return summary along with direct links to comments
| back to Discord user
| btbuildem wrote:
| Ah got it. Perhaps they should edit the intro then, it's
| misleading.
| stnmtn wrote:
| I agree, that language could be very improved. This is
| not a GPT-like LLM whose training corpus is HN comments,
| which I found to be an extremely interesting idea.
| Instead, it looks like it's finds relevant HN threads and
| tells GPT-3 (the existing model) to summarize them.
|
| To be clear, I think this is still very cool, just
| misleading.
| agolio wrote:
| Soon we will see language style transfer vectors, akin to
| the image style transfer at the peak of the ML craze 5-10
| years ago -- so you will be able to take a HN snark
| vector and apply it to regular text, you heard it here
| first ;)
| cookie_monsta wrote:
| There also needs to be at least one question mark at the end of
| a statement?
| britzkopf wrote:
| To truly capture the HN experience, the user should provide a
| parameter for the number of "well actually"'s they want to
| receive. So initial response should demonstrate clear expertise
| and make a great concise point in response to question, and
| then start the cascade of silly nitpicking.
| bradwood wrote:
| I think you'll find "I think you'll find" trumps "well
| actually".
|
| ;)
| vidarh wrote:
| I wish the results were reversed, so I could "well
| actually" your comment, but 'site:news.ycombinator.com
| "well actually"' gives ca. 4k results in Google and
| 'site:news.ycombinator.com "I think you'll find"' gives
| close to 17k results, so you appear to be right.
| actually_a_dog wrote:
| Well, "it turns out that" beats both, with about 26k
| results ;)
| Jimmc414 wrote:
| site:news.ycombinator.com "in my experience" 120K results
| ysavir wrote:
| I am mildly disappointed that none of the phrase pitches
| in this thread were phrased with the given pitch.
| genericone wrote:
| IANAL: unfortunately only 10.6k results, thought I had a
| winner for a second.
| reacharavindh wrote:
| Now that you say it, it will train itself for it while it
| learns from your comments ;-)
| einpoklum wrote:
| > The methodology I used here is a generic, scalable solution for
| distilling a knowledge corpus into an embodied intelligence
|
| The methodology used here is a generic solution for distilling a
| non-generic corpus of utterances of , into a generic platitude
| machine.
| MikeTheRocker wrote:
| I love this! I used to append "reddit" to my Google search
| queries to get best results, but the quality of dialog over there
| has really dropped in recent years. These days I've switched to
| appending "hackernews", but this is even better.
| leobg wrote:
| Same. I have "site:news.ycombinator.com" as a keyboard shortcut
| on my phone. Use it all the time.
| Cognitron wrote:
| [dead]
| ada1981 wrote:
| Hmm. I thought perhaps he was going to take the questions from
| askHN and the top upvoted comments and fine tuning a model with
| that as the prompt / reply pair.
|
| Curious how that would differ; but would be an expensive
| endevour.
| btbuildem wrote:
| My own experiments made me think that the impact of finetuning is
| comparable to that of a molecule in a drop in a bucket.
|
| > "AskHN" is a GPT-3 bot I trained on a corpus of over 6.5
| million Hacker News comments to represent the collective wisdom
| of the HN community in a single bot.
|
| I'm assuming you used the openai fine-tuning pathway to make a
| custom model?
|
| Have you tested the responses on vanilla GPT3 vs your custom
| model?
|
| I'd be curious to see the comparison.
| danuker wrote:
| Yeah, to me it looks like the learning rate was way too low to
| make a difference.
|
| I don't see any of the sublime and succinct snark.
| notahacker wrote:
| Yeah. Also full of GPT-3isms like "ultimately the choice ...
| comes down to the specific project and its ... requirements"
| and not nearly contrarian enough
|
| A bot focused on the output of HNers would insist on
| providing arguments against going through Google's interview
| process in the first place and suggestions that the correct
| answer to "Python or R" should be Haskell or Julia and would
| never suggest prioritising emotional vulnerability or being a
| happy person!
| danuker wrote:
| Thank you for the laffs =)
| clark-kent wrote:
| From the article, they did not use fine-tuning. This is
| semantic search + GPT-3 to provide human-like answers.
| btbuildem wrote:
| Thanks! I missed that part.
|
| The semantic search approach seems to focus the answers
| better than fine-tuning; at the cost of preloading the prompt
| with a lot of tokens, but with the benefit of a more
| constrained response.
| renewiltord wrote:
| ChatGPT and friends always talk like those Microsoft and Apple
| forum responders with 100k reputation.
|
| I see that you are asking about "How to get a job at Google". I
| will help you with "How to get a job at Google". In order to
| solve the problem of "How to get a job at Google" please follow
| the following steps first:
|
| - rewrite your resume in Google Docs
|
| - reinstall Chrome
|
| - apply to the job
|
| Let me know if I can help further with "How to get a job at
| Google". I like using it, but I have to tune my prompts to make
| sure that they don't bullshit me before getting to the point.
| pknerd wrote:
| Can anyone help me to guide some tutorials using GPT-3 model on a
| certain dataset. I am a Python programmer.
| dalmo3 wrote:
| This is nice! The official algolia search is useless.
|
| Otoh, did I miss something or is it only on discord?
| cactusplant7374 wrote:
| I really like Algolia. I usually use it to see if a particular
| link has been submitted. Other times I use it to find relevant
| comments or posts.
| adversaryIdiot wrote:
| I Didn't know the api supported downloading all of its database.
| Are you the reason HN has sporadic downtime lately? ;)
| la64710 wrote:
| Is there any LLM model that can be self hosted and fed a corpus
| of data to ingest for question answering? The part I find
| difficult is how to feed (not train) the open LLM models with
| entire dataset not available to public?
| bayan1234 wrote:
| The hack to solve this is to embed each paragraph in your large
| corpus. Find paragraphs most similar to the user query using
| embeddings. Put the paragraphs and the raw user query into a
| prompt template. Send the final generated prompt to gpt3.
|
| This actually works surprisingly well.
|
| Check out the OpenAI cookbook for examples.
| bilekas wrote:
| "He only went and did it... " !
| georgelyon wrote:
| Am I correct in understanding that this doesn't actually
| _generate_ answers based on HN, but instead finds semantically-
| near comments and sends them verbatim to GPT to summarize? Seems
| like a good enough hack, though I 'd love a detailed writeup of
| how to actually specialize an existing LLM with additional
| training data (like HN).
| ilaksh wrote:
| Technically it does give a specific answer to the question, but
| it is based on the semantically similar comments (and the
| question).
|
| The thing people don't realize is that right now there is a
| very large gap between the capabilities of a few models
| including OpenAI's most recent ones, and most of the other
| LLMs. So there are several options for actually training or
| fine-tuning with open models, but actually none of them have
| the language understanding and generation capabilities at the
| level of those new OpenAI models.
|
| As far as I know.
| serjester wrote:
| Agreed, I think the better approach is to do some custom tuning
| but that becomes cost prohibitive very quickly. Not really much
| different than Algolia with a minor GPT-3 integration but neat
| project regardless.
| jerpint wrote:
| The summary itself is still generated, but has all the context
| to do summarization in the prompt.
|
| It's very difficult to otherwise finetune existing LLMs. GPT
| itself is closed-sourced, and doesn't allow for finetuning
| (except via an opaque API and with limited amounts of data).
| Other open models are either very difficult to load in memory
| and/or simply not as expressive as GPT
| redox99 wrote:
| You can literally finetune these OpenAI models using their API.
| In this case it probably wasn't done because the author found
| it too much work and/or too expensive.
| osigurdson wrote:
| I have an experiment that uses the embeddings to visualize
| clusterings of HN comments (using tsne). Not super useful but
| interesting to view the comments in 3D and seeing how similar
| ones cluster together into mostly relevant themes.
| LeoPanthera wrote:
| I'm a little surprised that Hacker News comments weren't already
| in the GPT-3 training set. I just assumed that OpenAI had
| vacuumed up most of the web already.
| retube wrote:
| I am guessing they already were? But this is 100% pure,
| concentrated HN not contaminated with nonsense from the rest of
| the web :)
| MuffinFlavored wrote:
| Is it exclusively HN comments and nothing else? How does a
| model like that know how to speak English (noun/verb and all
| that) if you are starting from scratch and feeding it nothing
| but HN comments?
| neoromantique wrote:
| I'm sorry to be THAT GUY, but it is addressed in the
| article :)
|
| >GPT embeddings
|
| To index these stories, I loaded up to 2000 tokens worth of
| comment text (ordered by score, max 2000 characters per
| comment) and the title of the article for each story and
| sent them to OpenAI's embedding endpoint, using the
| standard text-embedding-ada-002 model, this endpoint
| accepts bulk uploads and is fast but all 160k+ documents
| still took over two hours to create embeddings. Total cost
| for this part was around $70.
| nkozyra wrote:
| > How does a model like that know how to speak English
|
| Mimicry.
| gorbypark wrote:
| In a nut shell, this is using openai's api to generate
| embeddings for top comments on hn, then also generating an
| embedding for the search term. It then can find the closest
| related comments for the given question by comparing the
| embeddings and then send the actual text to GPT3 to
| summarize. It's a pretty clever way to do it.
| nkozyra wrote:
| I have to assume that targeted/curated LLM training sets will
| have a tendency to be _less_ accurate than very general, just
| by the very nature of how they work.
|
| (edited for clarity)
| andai wrote:
| I know it's not quite analogous, but I fine-tuned GPT-3 on
| a small (200 examples) data set and it performed extremely
| poorly compared to the untrained version.
|
| This surprised me, I thought it wouldn't do much better,
| but I wasn't expecting that specializing it on my target
| data would reduce performance! I had fewer examples than
| the minimum OpenAI recommends, so maybe it was a case of
| overfitting or something like that.
| bityard wrote:
| If it's really trained exclusively off of HN comments, I
| expect most of the bot's responses will evade the actual
| question but spend several paragraphs debating the factual
| specifics of every possible related tangential point,
| followed by an thinly-veiled insult questioning the user's
| true motivations.
| jb1991 wrote:
| That had me laughing! Case in point, from a few days ago:
| https://news.ycombinator.com/item?id=34855372
| heleninboodler wrote:
| In no way does a typical HN comment debate _every possible_
| related tangential point. Do we expect a modicum of
| intellectual rigor? Yes. But to say every tangent is
| followed and scrutinized is simply factually untrue.
|
| And several paragraphs? I challenge you to show even a
| large minority of argumentative responses that veer into
| "several" paragraphs. You characterize this as "most of the
| ... responses" but I think that's unfair.
|
| One wonders why you'd resort to such hyperbole unless you
| were deliberately attempting to undermine the value of the
| site.
| GreenWatermelon wrote:
| This is my favorite type of humour.
| [deleted]
| Aromasin wrote:
| If you're not arguing over the semantics, rather than OP's
| clear-enough intent, are you really on HN?
___________________________________________________________________
(page generated 2023-02-22 23:00 UTC)