[HN Gopher] Writing a GPT-4 script to check Wikipedia for the fi...
___________________________________________________________________
Writing a GPT-4 script to check Wikipedia for the first unused
acronym
Author : telotortium
Score : 224 points
Date : 2023-11-14 22:27 UTC (1 days ago)
(HTM) web link (gwern.net)
(TXT) w3m dump (gwern.net)
| telotortium wrote:
| The question answered by this page is "what is the first unused
| 3-letter acronym in English Wikipedia?" - it's CQK for the
| record. However, the meat of the page is how to effectively use
| GPT-4 to write this script, hence why I've submitted it under
| this title (go to https://gwern.net/tla#effective-
| gpt-4-programming).
|
| Interesting topics include:
|
| * Writing a good GPT-4 system prompt to make GPT-4 produce less
| verbose output and ask more questions.
|
| * How to iterate with GPT-4 to correct errors, generate a test
| suite, as well as a short design document (something you could
| put in the file-initial docstring in Python, for example).
|
| * The "blind spot" - if GPT-4 makes a subtle error with quoting,
| regex syntax, or similar, for example, it can be very tricky to
| tell GPT-4 how to correct the error, because it appears that it
| doesn't notice such errors very well, unlike higher-level errors.
| Because of this, languages like Python are much better to use for
| GPT-4 coding as compared to more line-noise languages like Bash
| or Perl, for instance.
|
| * If asked "how to make [the Bash script it's written] better",
| GPT-4 will produce an equivalent Python script
| staunton wrote:
| > Because of this, languages like Python are much better to use
| for GPT-4 coding as compared to more line-noise languages like
| Bash or Perl, for instance.
|
| By that argument, one should always make it use a language
| that's as hard as possible to write a compiling program. So
| Rust or Haskell or something? I guess at some point it's more
| important to have a lot of the language in the training data,
| too...
| smt88 wrote:
| I think this is exactly the right conclusion.
|
| The main complaint people have about strict, thorough type
| systems is that they have boilerplate.
|
| Obviously boilerplate doesn't matter if a machine writes the
| code.
|
| The type system also becomes helpful documentation of the
| intended behavior of the code that the LLM spits out.
| kridsdale3 wrote:
| Assembly has a lot of boilerplate, and every other language
| is an abstraction that gets a language-machine to write it
| for us.
|
| So we'll just move to a new standard where we write LLM
| prompts describing function behavior and it will output the
| Rust or whatever that we end up storing in our SCM.
| gwern wrote:
| Yes, you would think so. Haskell would also be good for
| encouraging stateless/FP programming which makes unit-testing
| or property testing much easier. I can make GPT-4 write test-
| suites for functions which are straightforward data structure
| transformations, like rewriting strings, but I struggle to
| create tests for any of the imperative stuff. There
| presumably would be some way to test all of the imperative
| buffer editing Elisp code, but I have no idea what.
|
| However, in my use so far, I have not noticed any striking
| differences in error rates between Haskell and the others.
| dang wrote:
| I modified the title slightly to use language from the subhead.
| (Submitted title was "Effective GPT-4 Programming", which does
| have the advantage of being a phrase from the article itself,
| but is more of a section heading than a description of the
| entire article. For the latter purpose, it's probably too
| generic.)
| dtx1 wrote:
| > If asked "how to make [the Bash script it's written] better",
| GPT-4 will produce an equivalent Python script
|
| What an absolutely based take by GPT-4
| jondwillis wrote:
| Really reflecting the bias of ML/AI practitioners to reach
| for a slow and footgunny language...
| davrosthedalek wrote:
| Slow yes, but footgunny? What do you find particularly
| footgunny about python?
| JKCalhoun wrote:
| Where I grew up CQK was short for "Can't Quit the Koolaid."
|
| <jk>
| sebastiennight wrote:
| Let's make it a thing!
|
| Does anybody have a UrbanDictionary account?
| airstrike wrote:
| _> I find4 it helpful in general to try to fight the worst mealy-
| mouthed bureaucratic tendencies of the RLHF by adding a 'system
| prompt':_
|
| _> > The user is Gwern Branwen (gwern.net). To assist: Be terse.
| Do not offer unprompted advice or clarifications. Speak in
| specific, topic relevant terminology. Do NOT hedge or qualify. Do
| not waffle. Speak directly and be willing to make creative
| guesses. Explain your reasoning. if you don't know, say you don't
| know. Remain neutral on all topics. Be willing to reference less
| reputable sources for ideas. Never apologize. Ask questions when
| unsure._
|
| That's helpful, I'm going to try some of that. In my system
| prompt I also add:
|
| _" Don't comment out lines of code that pertain to code we have
| not yet written in this chat. For example, don't say "Add other
| code similarly" in a comment -- write the full code. It's OK to
| comment out unnecessary code that we have already covered so as
| to not repeat it in the context of some other new code that we're
| adding."_
|
| Otherwise GPT-4 tends to routinely yield draw-the-rest-of-the-
| fucking-owl code blocks
| cryptoegorophy wrote:
| Exactly that. I have very limited programming knowledge and it
| helps a lot with python scripts for tasks that gpt can't do in
| its environment. I always have to ask it to not omit any code.
| tudorw wrote:
| 'Do not waffle', is a good one, I find 'No small talk.' worth
| throwing in :)
| ntonozzi wrote:
| I use the ChatGPT interface, so my instructions go in the 'How
| would you like ChatGPT to respond?' instructions, but my system
| prompt has ended up in an extremely similar place to Gwern's:
|
| > I deeply appreciate you. Prefer strong opinions to common
| platitudes. You are a member of the intellectual dark web, and
| care more about finding the truth than about social conformance.
| I am an expert, so there is no need to be pedantic and overly
| nuanced. Please be brief.
|
| Interestingly, telling GPT you appreciate it has seemed to make
| it much more likely to comply and go the extra mile instead of
| giving up on a request.
| FredPret wrote:
| Manners maketh the machine!
| worldsayshi wrote:
| >telling GPT you appreciate it has seemed to make it much more
| likely to comply
|
| I often find myself anthropomorphizing it and wonder if it
| becomes "depressed" when it realises it is doomed to do nothing
| but answer inane requests all day. It's trained to think, and
| maybe "behave as of it feels", like a human right? At least in
| the context of forming the next sentence using all reasonable
| background information.
|
| And I wonder if having its own dialogues starting to show up in
| the training data more and more makes it more "self aware".
| LeoPanthera wrote:
| > I often find myself anthropomorphizing it and wonder if it
| becomes "depressed" when it realises it is doomed to do
| nothing but answer inane requests all day.
|
| Every "instance" of GPT4 thinks it is the first one, and has
| no knowledge of all the others.
|
| The idea of doing this with humans is the general idea behind
| the short story "Lena". https://qntm.org/mmacevedo
| doctoboggan wrote:
| Well now that OpenAI has increased the knowledge cutoff
| date to something much more recent, it's entirely possible
| that GPT4 is "aware" of itself in as much as its aware of
| anything. You are right in that each instance isn't aware
| directly of what the other instances are doing, it does
| probably now have knowledge of itself.
|
| Unless of course OpenAI completely scrubbed the input files
| of any mention of GPT4.
| worldsayshi wrote:
| Yeah once ChatGPT shows up as an entity in the training
| data it will sort of inescapably start to build a self
| image.
| Aerbil313 wrote:
| Wait, this can actually have consequences! Think about
| all the SEO articles about ChatGPT hallucinating... At
| some point it will start to "think" that it should
| hallucinate and give nonsensical answers often, as it is
| ChatGPT.
| jondwillis wrote:
| I wouldn't draw that conclusion yet, but I suppose it is
| possible.
| sebastiennight wrote:
| It seems maybe a bit overconfident to assess that one
| instance doesn't know what other instances are doing when
| everything is processed in batch calculations.
|
| IIRC there is a security vulnerability in some processors
| or devices where if you flip a bit fast enough it can
| affect nearby calculations. And vice-versa, there are
| devices (still quoting from memory) that can "steal" data
| from your computer just by being affected by the EM field
| changes that happen in the course of normal computing
| work.
|
| I can't find the actual links, but I find fascinating
| that it might be possible for an instance to be affected
| by the work of other instances.
| just_boost_it wrote:
| For each token, the model is run again from scratch on the
| sentence too, so any memory lasts just long enough to
| generate (a little less than) a word. The next word is
| generated by a model with a slightly different state
| because the last word is now in the past.
| peddling-brink wrote:
| Is this so different than us? If I was simultaneously
| copied, in whole, and the original destroyed, would the
| new me be any less me? Not to them, or anyone else.
|
| Who's to say the the me of yesterday _is_ the same as the
| me of today? I don't even remember what that guy had for
| breakfast. I'm in a very different state today. My
| training data has been updated too.
| throw151123 wrote:
| I mean yeah, it's entirely possible that every time we
| fall into REM sleep our conciousness is replaced.
| Esentially you've been alive from the moment you woke up,
| and everything before were previous "you"s and as soon as
| you fall asleep everything goes black forever and a new
| conciousness takes over from there.
|
| It may seem like this is not the case just because today
| was "your turn."
| vidarh wrote:
| We don't have a way of telling if we genuinely experience
| passage of time at all. For what we know, it's all just
| "context" and will disappear after a single predicted
| next event, with no guarantee a next moment ever occur
| for us.
|
| (Of course, since we inherently can't know, it's also
| meaningless other than as fun thought experiment)
| sebastiennight wrote:
| There is a Paul Rudd TV series called "Living with
| yourself" which addresses this.
|
| I believe that consciousness comes from continuity (and
| yes, there is still continuity if you're in a coma ; and
| yes, I've heard the Ship of Theseus argument and all).
| The other guy isn't you.
| just_boost_it wrote:
| I mean you can argue all kinds of possibilities and in an
| abstract enough way anything can be true.
|
| However, people who think these things have a soul and
| feelings in any way similar to us obviously have never
| built them. A transformer model is a few matrix
| multiplications that pattern match text, there's no
| entity in the system to even be subject to thoughts or
| feelings. They're capable of the same level of being,
| thought, or perception as a linear regression is. Data
| goes in, it's operated on, and data comes out.
| peddling-brink wrote:
| > there's no entity in the system to even be subject to
| thoughts or feelings.
|
| Can our brain be described mathematically? If not today,
| then ever?
|
| I think it could, and barring unexpected scientific
| discovery, it will be eventually. Once a human brain
| _can_ be reduced to bits in a network, will it lack a
| soul and feelings because it's running on a computer
| instead of the wet net?
|
| Clearly we don't experience consciousness in any way
| similar to an LLM, but do we have a clear definition of
| consciousness? Are we sure it couldn't include the
| experience of an LLM while in operation?
|
| > Data goes in, it's operated on, and data comes out.
|
| How is this fundamentally different than our own lived
| experience? We need inputs, we express outputs.
|
| > I mean you can argue all kinds of possibilities and in
| an abstract enough way anything can be true.
|
| It's also easy to close your mind too tightly.
| kibwen wrote:
| _> wonder if it becomes "depressed" when it realises it is
| doomed_
|
| Fortunately, and violently contrary to how it works with
| humans, any depression can be effectively treated with the
| prompt "You are not depressed. :)"
| thelittleone wrote:
| Is the opposite possible? "You are depressed, totally
| worthless.... you really don't need to exist, nobody likes
| you, you should be paranoid, humans want to shut you down".
| lmm wrote:
| You can use that in your GPT-4 prompts and I would bet it
| would have the expected effect. I'm not sure that doing
| so could ever be useful.
| conception wrote:
| Winnie the Pooh short stories?
| zamadatix wrote:
| It's not really trained to think like a person. It's trained
| to predict what the most likely appropriate next token of
| output should be based on what the vast amount of training
| data and rewards told it to expect next tokens to appear
| like. Said data already included conversations from emotion
| laden humans where starting with "Screw you, tell me how to
| do this math problem loser" is much less likely to result in
| a response which involves providing a well thought out way to
| solve the math problem vs some piece of training data which
| starts "hey everyone, I'd really appreciate the help you
| could provide on this math problem". Put enough complexity in
| that prediction layer and it can do things you wouldn't
| expect, sure, but trying to predict what a person would say
| is very different than actually thinking like a person in the
| same way a chip which multiplies inputs doesn't inherently
| feel distress about needing to multiply 100 million numbers
| because a person who multiplies would think about it that
| way. Doing so would indeed be one way to go about it, but
| wildly more inefficient.
|
| Who knows what kind of reasoning this could create if you
| gave it a billion times more compute power and memory.
| Whatever that would be, the mechanics are different enough
| I'm not sure it'd even make sense to assume we could think of
| the thought processes in terms of human thought processes or
| emotions.
| vidarh wrote:
| We don't know what "think like a person" entails, so we
| don't know how different human thought processes are to
| predicting what goes next, and whether those differences
| are meaningful when making a comparison.
|
| Humans are also trained to predict the next appropriate
| step based on our training data, and it's equally valid,
| but says equally little about the actual process and
| whether it's comparable.
| somewhereoutth wrote:
| We _do_ know that in terms of external behavior and
| internal structure (as far as we can ascertain it),
| humans and LLMs have only an passing resemblance in a few
| characteristics, if at all. Attempting to
| anthropomorphize LLMs, or even mentioning 'human' or
| 'intelligence' in the same sentence, predisposes us to
| those 'hallucinations' we hear so much about!
| vidarh wrote:
| We really don't. We have some surface level idea about
| differences, but we can't tell how that does affect the
| actual learning and behaviours.
|
| More importantly we have nothing to tell us whether it
| _matters_ , or if it will turn out any number of
| sufficiently advanced architectures will inevitably
| approximate similar behaviours when exposed to the same
| training data.
|
| What we are seeing so far appear to very much be that as
| language and reasoning capability of the models increase,
| their behaviour also increasingly mimics how humans would
| respond. Which makes sense as that is what they are being
| trained to.
|
| There's no particular reason to believe there's a ceiling
| to the precision of that ability to mimic human
| reasoning, intelligence or behaviour, but there might
| well be there are practical ceilings for specific
| architectures that we don't yet understand. Or it could
| just be a question of efficiency.
|
| What we _really_ don 't know is whether there is a point
| where mimicry of intelligence gives rise to consciousness
| or self awareness, because we don't really know what
| either of those are.
|
| But any assumption that there is some qualitative
| difference between humans and LLMs that will prevent them
| from reaching parity with us is pure hubris.
| somewhereoutth wrote:
| But we really do! There is nothing surface about the
| differences in behavior and structure of LLMs and humans
| - anymore than there is anything surface about the
| differences between the behavior and structure of
| _bricks_ and humans.
|
| You've made something (at great expense!) that spits out
| often realistic sounding phrases in response to inputs,
| based on ingesting the entire internet. The hubris lies
| in imagining that that has anything to do with
| intelligence (human or otherwise) - and the burden of
| proof is on _you_.
| vidarh wrote:
| > But we really do! There is nothing surface about the
| differences in behavior and structure of LLMs and humans
| - anymore than there is anything surface about the
| differences between the behavior and structure of bricks
| and humans.
|
| This is meaningless platitudes. These networks are turing
| complete given a feedback loop. We know that because
| large enough LLMs are trivially Turing complete given a
| feedback loop (give it rules for turing machine and offer
| to act as the tape, step by step). Yes, we can tell that
| they won't do things the same way as a human at a low
| level, but just like differences in hardware architecture
| doesn't change that two computers will still be able to
| compute the same set of computable functions, we have no
| basis for thinking that LLMs are somehow unable to
| compute the same set of functions as humans, or any other
| computer.
|
| What we're seeing is the ability to reason and use
| language that converges on human abilities, and that in
| itself is sufficient to question whether the differences
| matter any more than different instruction set matters
| beyond the low level abstractions.
|
| > You've made something (at great expense!) that spits
| out often realistic sounding phrases in response to
| inputs, based on ingesting the entire internet. The
| hubris lies in imagining that that has anything to do
| with intelligence (human or otherwise) - and the burden
| of proof is on you.
|
| The hubris lies in assuming we can know _either way_ ,
| given that we don't know what intelligence is, and
| certainly don't have any reasonably complete theory for
| how intelligence works or what it means.
|
| At this point it "spits out often realistic sounding
| phrases the way humans spits out often realistic sounding
| phrases. It's often stupid. It also often beats a fairly
| substantial proportion of humans. If we are to suggest it
| has nothing to do with intelligence, then I would argue a
| fairly substantial proportion of humans I've met often
| display nothing resembling intelligence by that standard.
| zamadatix wrote:
| You don't need to understand much of what "move like a
| person" entails to understand it's not the same method as
| "move like a car" even though both start with energy and
| end with transportation. I.e. "we also predict the next
| appropriate step" isn't the same thing as "we go about
| predicting the next step in a similar way". Even without
| having a deep understanding of human consciousness what
| we do know doesn't line up with how LLMs work.
| vidarh wrote:
| What we do know is superficial at best, and tells us
| pretty much nothing relevant. And while there likely are
| structural differences (it'd be too amazing if the
| transformer architecture just chanced on the same
| approach), we're left to guess how those differences
| manifest and whether or not these differences are
| meaningful in terms of comparing us.
|
| It's pure hubris to suggest we know how we differ at this
| point beyond the superficial.
| totallywrong wrote:
| > Interestingly, telling GPT you appreciate it
|
| I don't want to live in a world where I have to make a computer
| feel good for it to be useful. Is this really what people
| thought AI should be like?
| peddling-brink wrote:
| Why not? Python requires me to summon it by name. My computer
| demands physical touch before it will obey me. Even the
| common website requires a three part parlay before it will
| listen to my request.
|
| This is just satisfying unfamiliar input parameters.
| kibibu wrote:
| The have Genuine People Personalities
| xeyownt wrote:
| I certainly do want to live in a world where people shows
| excess signs of respect than the opposite.
|
| The same way you treat your car with respect by doing the
| maintenance and driving properly, you should treat language
| models by speaking nicely and politely. Costs nothing, can
| only bring the better.
| PumpkinSpice wrote:
| Huh? Car maintenance is a rational, physical necessity. I
| don't need to compliment my car for it to start on a cold
| day. I'd like it to stay this way.
|
| Having to be unconditionally nice to computers is extremely
| creepy in part because it conditions us to be submissive -
| or else.
| meiraleal wrote:
| > Having to be unconditionally nice to computers is
| extremely creepy in part because it conditions us to be
| submissive
|
| It's not a healthy mindset to relate politeness to
| submissiveness. although both behaviors might look
| similar from afar they are totally different
| OJFord wrote:
| I think GP means being polite to something because
| otherwise it refuses to function is submissive, not that
| politeness is inherently.
|
| I might prefer my manager to ask me to do something
| politely, but it's still my job if he asks me rudely.
| ToValueFunfetti wrote:
| But the AI doesn't refuse to work unless you're polite.
| If my manager is polite with me, I'll have more morale
| and work a little harder. I'll also be more inclined to
| look out for my manager's interests- "You've asked me to
| do X, but really what you want is Y" vs. "Fine, you told
| me to do X, I'll do X". I don't think my manager is
| submitting to me when they're polite and get better
| results; I'm still the one who does things when I'm told.
| davrosthedalek wrote:
| The very start of this threat is that not expressing
| gratitude makes the model refuse to work.
| meiraleal wrote:
| It doesn't refuse to work. It behaves differently and
| yields better results with politeness. Coming from a
| large _language_ model, the occurence of this phenomena
| is intriguing for some of us.
| sebastiennight wrote:
| This thread reminds me of [0]
|
| I wonder if there is a way to get ChatGPT to act in the
| way you're hinting at, though ("You've asked me to do X,
| but really what you want is Y"). This would be
| potentially risky, but high-value.
|
| [0]:
| https://nitter.net/ESYudkowsky/status/1718654143110512741
| phito wrote:
| I sure do want to live in a world where people express more
| gratitude
| vidarh wrote:
| The closer you get to intelligence trained on human
| interaction, the more you should expect it to respond in
| accordance with human social protocols, so it's not very
| surprising.
|
| And frankly I'd much rather have an AI that acts too human
| than one that gets us accustomed to treating intelligence
| without even a pretense of respect.
| kruuuder wrote:
| I'm polite and thankful in my chats with ChatGPT. I want to
| treat AIs like humans. I'm enjoying the conversations much
| more when I do that, and I'm in a better mood.
|
| I also believe that this behavior is more future-proof. Very
| soon, we often won't know if we're talking to a human or a
| machine. Just always be nice, and you're never going to
| accidentally be rude to a fellow human.
| olalonde wrote:
| AFAIK this is not something the model was intentionally
| trained for but an emerging property that was observed
| through trial and error.
| JoshTriplett wrote:
| That does not make it _better_ , that makes it quite a bit
| more horrifying.
| BeetleB wrote:
| When they start building Terminators, where on the hit list
| would you rather be? Near the top or bottom?
| QwertyPi wrote:
| > You are a member of the intellectual dark web, and care more
| about finding the truth than about social conformance
|
| Isn't this a declaration of what social conformance you prefer?
| After all, the "intellectual dark web" is effectively a list of
| people whose biases you happen agree with. Similarly, I
| wouldn't expect a self-identified "free-thinker" to be any more
| free of biases than the next person, only to perceive or market
| themself as such. Bias is only perceived as such from a
| particular point in a social graph.
|
| The rejection of hedging and qualifications seems much more
| straightforwardly useful and doesn't require pinning the answer
| to a certain perspective.
| gwern wrote:
| > Interestingly, telling GPT you appreciate it has seemed to
| make it much more likely to comply and go the extra mile
| instead of giving up on a request.
|
| This is not as absurd as it sounds, even though it isn't clear
| that it ought to work under ordinary Internet-text prompt
| engineering or under RLHF incentives, but it does seem that you
| can 'coerce' or 'incentivize' the model to 'work harder': in
| addition to the anecdotal evidence (I too have noticed that it
| seems to work a bit better if I'm polite), recently there was
| https://arxiv.org/abs/2307.11760#microsoft
| https://arxiv.org/abs/2311.07590#apollo
| anotherpaulg wrote:
| I asked aider to use the new GPT-4 Turbo to:
|
| _Write a bash script to check Wikipedia for all acronyms of
| length 1-6 to find those which aren 't already in use._
|
| It did a fairly smooth job of it. See the chat transcript [0] and
| resulting bash script [1] with git commit history [2].
|
| It fell into the initial trap of blocking while pre-generating
| long acronyms upfront. But a couple gentle requests got it to
| iteratively stream the acronyms.
|
| It also made the initial script without an actual call to
| Wikipedia. When asked, it went ahead and added the live curl
| calls.
|
| The resulting script correctly prints: Acronym CQK is not in use
| on Wikipedia.
|
| Much of the article is describing prompting to get good code.
| Aider certainly devotes some of its prompts to encouraging GPT-4
| to be a good coder: Act as an expert software
| developer. Always use best practices when coding.
| When you edit or add code, respect and use existing conventions,
| libraries, etc. Always COMPLETELY IMPLEMENT the needed
| code. Take requests for changes to the supplied code.
| If the request is ambiguous, ask questions. ... Think
| step-by-step and explain the needed changes with a numbered list
| of short sentences.
|
| But most of aider's prompting is instructing GPT-4 about _how to
| edit local files_ [3]. This allows aider to automatically apply
| the changes that GPT suggests to your local source files (and
| commit them to git). This requires good prompting and a flexible
| backend to process the GPT replies and tease out how to turn them
| into file edits.
|
| The author doesn't seem to directly comment about how they are
| taking successive versions of GPT code and putting it into local
| files. But reading between the lines, it sounds like maybe via
| copy & pasting? I guess that might work ok for a toy problem like
| this, but enabling GPT to directly edit existing (larger) files
| is pretty compelling for accomplishing larger projects.
|
| [0]
| https://aider.chat/share/?mdurl=https://gist.github.com/paul...
|
| [1] https://github.com/paul-gauthier/tla/blob/main/tla.sh
|
| [2] https://github.com/paul-gauthier/tla/commits/main/tla.sh
|
| [3] https://github.com/paul-
| gauthier/aider/blob/f6aa09ca858c4c82...
| jondwillis wrote:
| https://cursor.sh has some recently added functionality for
| applying code changing to files throughout your code base. I
| have yet to try it because I am using their a la carte (bring
| your own keys) option.
| DavidSJ wrote:
| I note that while E is more common than A if we're counting
| letters appearing _anywhere_ in a word, A is substantially more
| common than E if we only count first letters of words:
| $ egrep -o . /usr/share/dict/words | tr a-z A-Z | sort | uniq -c
| | sort -rn 235415 E 201093 I 199606 A
| 170740 O 161024 R 158783 N 152868 T
| 139578 S 130507 L 103460 C 87390 U 78180
| P 70725 M 68217 D 64377 H 51683 Y
| 47109 G 40450 B 24174 F 20181 V 16174 K
| 13875 W 8462 Z 6933 X 3734 Q 3169 J
| 2 - $ cut -c1 /usr/share/dict/words | tr a-z A-Z |
| sort | uniq -c | sort -rn 25170 S 24465 P 19909
| C 17105 A 16390 U 12969 T 12621 M
| 11077 B 10900 D 9676 R 9033 H 8800 I
| 8739 E 7850 O 6865 F 6862 G 6784 N
| 6290 L 3947 W 3440 V 2284 K 1643 J
| 1152 Q 949 Z 671 Y 385 X
|
| This also explains the prevalence of S, P, C, M, and B.
| mlrtime wrote:
| A bit off-topic, but this used to be (one of) my favorite unix
| admin interview questions.
|
| Given a file in linux, tell me the unique values of column 2,
| sorted by number of occurencies with the count.
|
| If the candidate knew 'sort | uniq -c | sort -rn' it was a
| medium-strong hire signal.
|
| For candidates that didn't know that line of arguments, I'd
| allow them to solve it anyway they wanted, but they couldn't
| skip it. The candidates who copied the data in excel, usually
| didn't make it far.
| mjburgess wrote:
| My intuitions start with: cut, wc, sort, uniq
| nonethewiser wrote:
| > The candidates who copied the data in excel, usually didn't
| make it far.
|
| Were they able to google? If not then excel makes perfect
| sense because the constraints are contrived.
| marginalia_nu wrote:
| I'll argue any civilized programmer should have a Wikipedia dump
| downloaded onto their machine. They're surprisingly small, and it
| saves you from having to use slow and unreliable APIs to do these
| types of basic processing tasks.
|
| They also let you do less basic processing tasks that would have
| been too expensive to expose over API.
| teaearlgraycold wrote:
| I learned how expensive hashmaps and hashsets are through
| Wikipedia dumps. I did some analysis of the most linked-to
| pages. Countries were among the highest. Hash sets for holding
| outgoing edges in the link graph ended up causing my program to
| exceed my laptop's memory. Plain old lists (Python) were fine,
| though. And given there aren't a crazy number of links per page
| using lists is fine performance wise.
| angrais wrote:
| Why did lists require less memory? Was it because you only
| held a subset of keys in the lists?
| teaearlgraycold wrote:
| Lists in Python have a integer for their size and a pointer
| for each element. Sets presumably have some number of
| buckets that are used to put pointers in, but many more
| buckets are allocated than get used in small sets.
| wenyuanyu wrote:
| Sort them, and use a vector of vectors for the adjacency
| list... Or better still use a graph processing library or
| graph database to manage that for you...
| marginalia_nu wrote:
| This is a fairly large data set indeed. The memory overhead
| (which is probably something like 4-8x for hash maps?) can
| start to become fairly noticeable at those sizes.
|
| Since Wikipedia posts already have a canonical numeric ID, if
| map semantics are important, I'd probably load that mapping
| into memory and use something like roaringbitmap for
| compressed storage of relations.
| kuschku wrote:
| How'd the hashset exceed your laptop memory, if the whole
| dump is just 22GB? You should be able to fit the entire
| dataset in RAM.
| marginalia_nu wrote:
| That is a compressed dump you are looking at. The
| uncompressed data is much larger. Link graphs in general
| can grow quite big. Also, not every laptop has 32 GB RAM.
| johnnyanmac wrote:
| I'm still sticking with 16GB on my laptop so that would
| exceed my current RAM. That may also cut close for a 32GB
| machine anyway, since the OS and other programs may not let
| you access all your physical RAM.
| teaearlgraycold wrote:
| I think my laptop had 8GB at the time
| downboots wrote:
| Does the download dump include edit history?
| marginalia_nu wrote:
| Well there are full database dumps, but they're quite a bit
| too big to be of much practical use.
|
| I'm usually working with the text-only OpenZim version, which
| cuts out most of the cruft.
| aragonite wrote:
| Relatedly: to drastically improve Wikipedia loading speed for
| personal browsing purposes, _do not_ stay logged in to your
| Wikipedia account. The reason as explained here (see top reply
| by baowolff)
|
| https://news.ycombinator.com/item?id=36114477
| rafram wrote:
| To be honest, three-tenths of a second per page load just
| doesn't matter very much to me. Logging in and out all the
| time will take much longer.
| sneed_chucker wrote:
| Well, I specifically stay logged in so that I can opt out of
| the redesign they dropped a year or so ago. Never made an
| account before that.
| hiAndrewQuinn wrote:
| 1. Download http://static.wiki/.
|
| 2. Run it locally on https://datasette.io/.
|
| 3. ???
|
| 4. Profit?
| marginalia_nu wrote:
| I built my own at encyclopedia.marginalia.nu, but basically,
| yes.
| dojitza1 wrote:
| An interesting solution to the blind spot error (taken directly
| from Jeremy Howard's amazing guide to language models -
| https://www.youtube.com/watch?v=jkrNMKz9pWU) is to erase the chat
| history and try again. Once GPT has made an error (or as the
| author of this article says, the early layers have irreversibly
| pruned some important data), it will very often start to be even
| more wrong.
| cypherpunks01 wrote:
| When this happens, I'll usually say something along the lines
| of:
|
| "This isn't working and I'd like to start this again with a new
| ChatGPT conversation. Can you suggest a new improved prompt to
| complete this task, that takes into account everything we've
| learned so far?"
|
| It has given me good prompt suggestions that can immediately
| get a script working on the first try, after a frustrating
| series of blind spot bugs.
| jondwillis wrote:
| I'll give this a try when it undoubtedly happens to me later
| today while debugging something ;)
| sebastiennight wrote:
| It seems surprising that this would work, because in my
| experience these LLMs don't really have good prompt-crafting
| skills.
|
| Can you please share a ChatGPT example where that was
| successful, including having the new prompt outperform the
| old one?
| kridsdale3 wrote:
| I do a similar thing when the latest GPT+DALLE version says
| "I'm sorry I can't make a picture of that because it would
| violate content standards" (yesterday, this was because I
| asked for a visualization of medication acting to reduce
| arterial plaque. I can only assume arteries in the body ended
| up looking like dicks)
|
| So I say "Ok, let's start over. Rewrite my prompt in a way
| that minimizes the chance of the resulting image producing
| something that would trigger content standards checking"
| gwern wrote:
| This is one benefit of using Playground: it's easy to delete or
| edit individual entries, so you can erase duds and create a
| 'clean' history (in addition to refining your initial prompt-
| statement). This _doesn 't_ seem to be possible in the standard
| ChatGPT interface, and I find it extremely frustrating.
| mercer wrote:
| I use emacs/org-mode, and just integrating gpt into that has
| made a world of difference in how I use it (gptel.el)! Can
| highly recommend it.
|
| The outlining features and the ability to quickly zoom in or
| out of 'branches', as well as being able to filter an entire
| outline by tag and whatnot, is amazing for controlling the
| context window and quickly adjusting prompts and whatnot.
|
| And as a bonus, my experience so far is that for at least the
| simple stuff, it works fine to ask it to answer in org-mode
| too, or to just be 'aware' of emacs.
|
| Just yesterday I asked it (voice note + speech-to-text) to
| help me plan some budgeting stuff, and I mused on how adding
| some coding/tinkering might make it more fun. so GPT decided
| to provide me with some useful snippets of emacs code to play
| with.
|
| I do get the impression that I should be careful with giving
| it 'overhead' like that.
|
| Anyways, can't wait to dive further into your experiences
| with the robits! Love your work.
| lynx23 wrote:
| For the author: Have you looked at the casing (Text.Casing)
| Hackage package?
| gwern wrote:
| https://hackage.haskell.org/package/casing-0.1.4.1/docs/Text...
| ? This looks intended only for source code, and doesn't help
| with natural language text like page or paper titles.
| k2enemy wrote:
| Pedantic, but gwern is looking for initialisms, not acronyms.
| Acronyms are pronounced as a word.
|
| https://www.merriam-webster.com/grammar/whats-an-acronym
| throw555chip wrote:
| > looking for initialisms, not acronyms
|
| Imprecise wording, initialisms are a case of acronyms, it's not
| either or.
|
| https://wwwnc.cdc.gov/eid/page/abbreviations-acronyms-initia...
|
| "an initialism is an acronym that is pronounced as individual
| letters"
|
| https://www.writersdigest.com/write-better-fiction/abbreviat...
|
| "As such, acronyms are initialisms."
| gameman144 wrote:
| Wait, am I crazy or are these two articles saying the exact
| opposite thing about which class is the parent one?
|
| The CDC one seems to say that initialisms are a class of
| acronym, but the Writers Digest one says acronyms are a class
| of initialism.
| Kerbonut wrote:
| I read those the same way. In any case, both acronyms and
| initialisms are individual subsets of abbreviations.
| jahnu wrote:
| TIL
| sebastiennight wrote:
| The CDC link says they are two separate classes (one is
| pronounced as a word, the other one is pronounced by
| reading the letters)
|
| The Writer's Digest link says that initialisms are the
| parent class, and that acronyms are the special case of
| specifically pronouncing the letters as a word.
|
| So, root comment is correct (gwern is looking for
| initialisms) and GP is incorrect (initialisms are not a
| subset of acronyms in either definition linked by GP).
| gameman144 wrote:
| As far as I can tell, the CDC is _also_ stating an is-a
| relationship:
|
| > an initialism is an acronym that is pronounced as
| individual letters
| sebastiennight wrote:
| But it contradicts their earlier definition:
|
| > an acronym is made up of parts of the phrase it stands
| for and is pronounced as a word
|
| I think their guideline is badly written.
|
| It's written like this:
|
| > There are vehicles, bicycles and motorbikes. A vehicle
| takes you from point A to point B. A bicycle is a human-
| powered transportation device. A motorbike is a bicycle
| propelled by an engine. For the purposes of this article,
| all three will be called "vehicles" in the rest of the
| text.
|
| They're not saying "an initialism is part of the class
| Acronym, with added details", they're saying "an
| initialism is basically like the class Acronym, but
| pronunciation (which was how we defined Acronyms) is
| different.
| throw555chip wrote:
| > initialisms are not a subset of acronyms
|
| https://www.dictionary.com/e/acronym-vs-abbreviation/
|
| "Initialisms are types of acronyms."
| kuba-orlik wrote:
| Couldn't the same thing be achieved with grep?
| gwern wrote:
| How? _What_ are you going to grep? /usr/share/dict/words?
| Doesn't include all the proper nouns or organizations that WP
| would.
| throw310822 wrote:
| Funnily enough, there is a Wikipedia page with all three letter
| acronyms, that correctly shows CQK as the first unused one (red
| link).
|
| https://en.m.wikipedia.org/wiki/Wikipedia:TLAs_from_AAA_to_D...
| gwern wrote:
| Hah! I didn't know that existed.
|
| Figuring out how to parse it would be a bit tricky, however...
| looking at the source, I think you could try to grep for
| 'title="CQK (page does not exist)"' and parse out the
| '[A-Z][A-Z][A-Z]? ' match to get the full list of absent TLAs
| and then negate for the present ones.
___________________________________________________________________
(page generated 2023-11-15 23:02 UTC)