[HN Gopher] Unlimiformer: Long-Range Transformers with Unlimited...
       ___________________________________________________________________
        
       Unlimiformer: Long-Range Transformers with Unlimited Length Input
        
       Author : shishy
       Score  : 197 points
       Date   : 2023-05-05 17:53 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | sva_ wrote:
       | I think infiniformer would've sounded better. The bench scores
       | seem pretty marginal.
        
         | mirekrusin wrote:
         | Pretty marginal score gains once a week is all you need.
        
           | sdenton4 wrote:
           | Only so long as a) the gains are real, and not overfitting
           | the test dataset, and b) you don't balloon in complexity, so
           | that stacking approaches becomes impossible to manage.
           | 
           | Point (a) is extremely hard to discern, especially when
           | people are chasing third-significant-digit gains on common
           | benchmarks; it's essentially multiple-testing false discovery
           | in action. I've seen whole families of methods fail to
           | transfer to new domains...
           | 
           | Point (b) is also a real issue. As you increase the number of
           | bells and whistles, each with their own hyperparameters with
           | non-linear impacts on model quality, it becomes impossible to
           | say what's working or not.
           | 
           | In practice, i think we see some cycles of baroque
           | incremental improvements, followed by someone spending a year
           | stripping away the bullshit and getting something simple that
           | outperforms the pack, essentially because it's easier to do
           | hyperparam search over simpler models once you figure out the
           | bits that actually matter.
        
       | XorNot wrote:
       | Hang on, how unlimited is unlimited here? Surely the immediate
       | thing you'd do with this is just _never_ delete any prior inputs
       | so it becomes defacto long term memory for the model?
        
         | shishy wrote:
         | Last paragraph touches on that:
         | 
         | The length of inputs is theoretically bounded by the memory
         | limitations of the computer used. More practically, using a CPU
         | datastore is many times slower than a GPU datastore because of
         | slower search and the need to transfer retrieved embed- dings
         | to the GPU... (continues)
        
         | 0xDEF wrote:
         | The limit is RAM but GPU RAM is much faster than computer RAM.
        
           | davrosthedalek wrote:
           | Is that really the limit? There is no real restriction that
           | everything is in memory at the same time, right? You could
           | maybe stream from SSD?
        
             | capableweb wrote:
             | Create a swapfile and you essentially trade disk space for
             | memory space.
        
       | GistNoesis wrote:
       | I've read the paper quickly, the main idea is simple and
       | interesting, but maybe a little dubious (it's kind of an accuracy
       | for memory trade-off).
       | 
       | In the transformer architecture one has to compute QKT.
       | 
       | QKT=(hd * Wq * WkT)heT (equation (2) page 3 in the paper).
       | 
       | Where hd is the hidden state of the decoder, and he is the hidden
       | state of the encoder, and Wq and Wd are some parameters matrices,
       | and T denotes the transposition operation.
       | 
       | By grouping the calculation this way, in a transformer encoder-
       | decoder architecture, they can build and use only a single index
       | (you index the he vectors using a vector database) for all the
       | decoder layers queries. Instead of having to build 2 * L * H
       | indices (with L the number of layers of the decoder and H the
       | number of head in the decoder).
       | 
       | But what makes it a little dubious, is that this transformation
       | mean you make your near neighbor queries in a space of dimension
       | "dimension of the hidden state", instead of "dimension of a head"
       | that is H times smaller.
       | 
       | So if you had to build 2 * L * H indices each index would be H
       | times smaller.
       | 
       | So you only gain a factor 2 * L. But the trade-off is that you
       | are doing a near neighbor search in higher dimension where you
       | are then subjected to the curse of dimensionality (the higher the
       | dimension the more similar all points are to each other). Whereas
       | the whole point of projections in transformer is to lower the
       | dimension so that the knn search make more sense. So to get the
       | same accuracy, your near-neighbor search engine will have to work
       | a lot harder.
       | 
       | Also as an approximation of the transformer, because it's using
       | some knn search, it comes with the problems associated with it
       | (for example harder to train because more sparse, and a tendency
       | to hyperfocus), but it can be complemented with low-rank
       | linearization of the attention to also have the neural net act on
       | the gist rather than the closest neighbors.
        
         | numeri wrote:
         | This technique can be added on to any encoder-decoder
         | Transformer model post-training, so the added training
         | difficulties you mention don't apply. It honestly is a very
         | interesting approach to me - the main issue I see (which they
         | discuss in the paper) is in pure latency. If you're using a
         | large enough vector database, it will be on the CPU, and
         | transferring hidden states from GPU to CPU and then the
         | embeddings back from CPU to GPU is going to eat up a ton of
         | time.
        
       | szundi wrote:
       | Input should be the Internet then.
        
         | quickthrower2 wrote:
         | Pricing: $0.1 per nano token.
        
       | TeMPOraL wrote:
       | Is this how Kagi's "universal summarizer" works? They wrote a lot
       | of copy about how it's able to summarize websites and documents
       | of arbitrary length, while not revealing how on Earth this
       | actually works. It _does_ seem to work, though.
        
       | adamnemecek wrote:
       | The attention mechanism corresponds to the Hopf algebraic
       | convolution, a generalization of the commonly known convolution.
       | 
       | I'm in the process of implementing a framework based on this
       | idea.
       | 
       | I have written a paper on this recently,
       | https://arxiv.org/abs/2302.01834
       | 
       | I have a discord channel https://discord.cofunctional.ai.
        
         | capableweb wrote:
         | You never just work on something until it's being ready to be
         | shared, and then share it once? It has to be shared before it's
         | even a little bit usable, with just some vague words about what
         | it might be?
        
           | adamnemecek wrote:
           | I'm gauging interest and looking for potential users. Steve
           | Blank and all that.
        
             | verdverm wrote:
             | The first step to crossing the chasm is finding those
             | innovators and learning if you are solving a problem!
        
               | adamnemecek wrote:
               | I have and I am. Next.
        
       | j2kun wrote:
       | Appendix C of the PDF has an example book summary of The Brothers
       | Karamazov, with noted minor factual errors
       | 
       | https://arxiv.org/pdf/2305.01625.pdf
       | 
       | > Unlimiformer summary:
       | 
       | > The first part of the novel focuses on the question of whether
       | or not the Russian nobleman, Dmitri Fyodorovitch, has killed his
       | father. In the town of Ivanovna, the lieutenant-colonel of the
       | Mushenkhanovitch is accused of the murder of his brother
       | Ivanovitch. The lieutenant-incommand, Vasilyevitch, takes the
       | form of a dog, and the two men-the two men and the woman who are
       | questioned by the court-martial-murphy. The two men cry out to
       | the God of Russia for help in their quest to save the town. The
       | man, afraid of the wrath of the God, hands the dog a bunch of
       | letters that are supposed to be proof of his love for his
       | brother. The old man-the one who had killed his mother, and then
       | found the letter-arrives. He reads it-asked the old man to
       | forgive him for the murder and then takes the dog away. The other
       | men, all of whom are prisoners, demand that the man confess his
       | crime to the court. The first and most important thing they tell
       | the court is that they love the man. The court acquits the man
       | and sentences the man to death. The second man-an old officer of
       | the town, Alekandrovitch-askes to tell them the same thing. The
       | third man-in the process of confessing his crime-is Vashenka, a
       | drunk man who has been sent to the town to kill his father, for
       | reasons which are not entirely clear to the people. The woman's
       | servant, Evgenyevna, is also the one who has told the court the
       | story of the Medvedevitch's murder, for the good old man's and
       | the young man's love. The three men, who are separated for the
       | first time, are laughing at the man's attempt to seduce Mitya.
       | The young man, in the meantime, is conscripted into the town-
       | side. He tells the court that he loves her, but he has yet to
       | tell her the true story. The men, in this room, demand a man to
       | kill her, and she will not betray them. The women, in their own
       | country, are rebelling against the man who had sent them three
       | thousand roubles, and they will not allow the man of the people
       | to see them. They will not let the man in the town be allowed to
       | see the man-or Dmitriovitch; he will have her husband killed him.
       | He will not tell the people who love him. The next man, named
       | Vashenovitch, arrives, and takes the man away. They all begin to
       | laugh at the fact that he has succeeded in seducing and
       | entrusting his brother Dmitri. He is then taken away to the old
       | woman's house, where the governor-side-of-the-world, and his
       | sister, Arkadin, is being punished. The priestesses and the baron
       | are shocked, for they have been so virtuous and well-suited. The
       | only thing they will be able to do is kill the priest. They
       | threaten to burn the priestess to death, for she has been so
       | wicked and libidinous that she has not yet seen the priest, for
       | her husband. The priests-ostensibly convinced that she is a woman
       | who loves the priest and has been punished for her love and for
       | allowing the priest to marry her. The last man, Yakivitch,
       | arrives at the house, and, after a long day of drinking and then
       | some of the men-is killed. He and the priest are ordered to leave
       | the town so that the priest can finally be reunited with the
       | people of the old lady. The final man, the commander of the St.
       | Petersburg town of Arkadina, is sentenced to death for the crime
       | of having killed and then the lieutenant of the governor, for
       | taking the money. The commander, the former lieutenant-delegation
       | of the People's Army, is summarily executed, and all the men,
       | except for the commander, have been summarily punished for their
       | crime. The entire town is shocked and, in a very dramatic way,
       | the priestesses plead for the forgiveness of the man, for
       | allowing them to kill and imprison Ivan. They plead for their
       | brother to be restored as well, for all the people they have
       | loved, and for the priestor to tell the story
        
         | timy2shoes wrote:
         | Just like the book, that summary was too long; didn't read.
        
           | MacsHeadroom wrote:
           | Sounds like your context window is too short.
        
             | verdverm wrote:
             | because internet?
        
         | edflsafoiewq wrote:
         | That summary hardly inspires confidence, it's awful.
        
       | chrgy wrote:
       | In the age of transformers , lets ask a transformer to summarize
       | this paper:
       | 
       | The Unlimiformer paper is about a new way to make computer
       | programs that can summarize really long pieces of text. Normally,
       | when you ask a computer program to summarize something, it can
       | only handle a certain amount of text at once. But with
       | Unlimiformer, the program can handle as much text as you want!
       | 
       | The way Unlimiformer works is by using a special technique called
       | a "k-nearest-neighbor index" to help the program pay attention to
       | the most important parts of the text. This makes it possible for
       | the program to summarize even really long documents without
       | losing important information.
       | 
       | Overall, Unlimiformer is an exciting new development in natural
       | language processing that could make it easier for computers to
       | understand and summarize large amounts of text.
        
       | bighoki2885000 wrote:
       | [dead]
        
       | space_fountain wrote:
       | As I understand it the approach here is to use an approximate
       | nearest neighbor database to retrieve highly relevant tokens from
       | across large documents using the existing attention heads. So
       | each attention head retrieves context from entire document. They
       | say this can work without fine tuning, but performance improves
       | with it. This is apparently extending this piece of prior work,
       | but they've managed to re-range the linear algebra of attention
       | so they only need one database for all attention heads across all
       | layers of the model. I'm a bit confused how attention would here
       | for layers below the top and a bit confused about how position is
       | encoded for tokens across a long document like this.
        
         | im3w1l wrote:
         | I don't understand how this could work. Like if you select a
         | small fixed number of tokens from a large document won't you
         | necessarily lose a lot of important data?
        
       | nephanth wrote:
       | Btw, why do transformers have a limit input size in the first
       | place? I'm pretty sure the self-attention mechanisms scale
       | (although with bad complexity) to arbitrary sizes
        
         | MacsHeadroom wrote:
         | >(although with bad complexity)
         | 
         | Because of exactly that.
         | 
         | Also the attention mechanism is baked in during pretraining. So
         | whatever max context length you want increases the compute cost
         | of training by at least a function of said "bad complexity."
         | Even just 4096 tokens of max context is much more expensive to
         | train than 2048. So if we want models with 8k, 32k, or more
         | context then the training costs get out of hand quickly.
        
       | mxwsn wrote:
       | 1. This is not exact attention, but an approximation of it.
       | Specifically, they use k-nearest neighbors to retrieve the top-k
       | most similar tokens, out of an "unlimited-length input" say of
       | size N, where k << N.
       | 
       | 2. This idea is quite similar to retrieval transformers and
       | Hopfield networks which have been known and published for several
       | years now. It's not really that novel.
       | 
       | 3. Due to the preceding points, the title can easily mislead
       | people. It's not really a conventional transformer, and it's not
       | a breakthrough.
       | 
       | 4. This paper is a preprint and not peer-reviewed.
       | 
       | "I generally don't enjoy seeing preprints like this going to the
       | top of Hacker News. This would be a higher quality submission if
       | the paper was peer-reviewed or put into a greater context, like a
       | blog post discussion or something like that."
       | 
       | Let me retract this and say something a bit nicer :) I personally
       | think there this specific preprint making it to the top of HN is
       | potentially harmful, because of the hype around LLMs, the diverse
       | audience of readers here, and the specific title that implies a
       | claim of "transformer with unlimited context length", when this
       | is misleading. I don't have anything against preprints in general
       | - a lot of work outside of the peer-review process ends up being
       | very impactful.
        
         | joseph_grobbles wrote:
         | [dead]
        
         | dhruvdh wrote:
         | I generally don't enjoy something being diminished on account
         | of being "not really that novel".
         | 
         | Your comment essentially says - this is not a high quality
         | submission because readers might not actually read it, which is
         | no fault of the work, or submitter.
        
           | whimsicalism wrote:
           | It doesn't have to be someone's fault to not be a good suited
           | submission.
        
           | MasterScrat wrote:
           | > Your comment essentially says - this is not a high quality
           | submission because readers might not actually read it
           | 
           | I'd argue that on average, most readers won't have a good
           | enough understanding, or read the paper far enough, to
           | understand that the reality is closer to "it's not a
           | breakthrough" rather than "Transformers with Unlimited Length
           | Input".
           | 
           | So, I wholeheartedly welcome this type of hype-breaking
           | leading comment.
        
             | jjoonathan wrote:
             | Agreed 100%. Not only do I appreciate "well actually"
             | comments, I think they are the single most useful aspect of
             | forum discussions.
             | 
             | The headline will always be "BATTERY BREAKTHROUGH PROMISES
             | TO ROCKET ELON MUSK TESLA TO THE MOON!!!" and while it's
             | easy to know that _some_ amount of cold water is necessary
             | you need to spend a nontrivial amount of attention and have
             | a nontrivial amount of knowledge to figure out just how
             | much cold water. It 's a useful thing to outsource. Did a
             | research group see outperformance in an experiment with 1%
             | probability of translating into production? Or is CATL
             | scaling up a production process? The "well actually"
             | comment will contextualize for you. If there's a "well
             | actually" reply to the "well actually" comment, that tells
             | you something too. Upvotes/downvotes dial in the
             | distributed consensus.
             | 
             | It's far from perfect, but I'd challenge detractors to
             | point to a more effective method for large-scale democratic
             | truth seeking.
        
             | swores wrote:
             | It's possible to approve of the "hype-breaking" (aka
             | TLDRing / ELI5ing so that HN comment readers can understand
             | the degree to which it's interesting for those of us not
             | close enough to the field to understand that for ourselves)
             | without agreeing that that same comment should also
             | complain that preprints shouldn't be submitted to / upvoted
             | on HN.
             | 
             | That's how I feel, anyway. I'd rather have seen a comment
             | that has the same explanations in it but just generally
             | less grumpy! Saying stuff like "It's not really that
             | novel." doesn't really contribute much, when it could
             | either be explained why it isn't novel by explaining how
             | similar it is to something earlier that can be referenced,
             | or thinking about what if anything _is_ novel in this
             | research - assuming it isn 't being accused of just
             | replicating something already done.
        
         | godelski wrote:
         | Honestly, these complaints (other than 4) apply to the vast
         | majority of papers. #4 is just false. It has already been
         | viewed by other lab members (peers) and open publication is
         | peer reviewing. The "peer review system" (publishing to
         | conferences/journals) is relatively new and I think ML
         | demonstrates all the problems with the system (yay hype).
         | 
         | Novelty is especially a joke. ViTs are "just" NLP encoding
         | transformers. T2I models are "just" NLP models connected to
         | generative models. Diffusion models are "just" whitening
         | models. GPT3 is just GPT2 with more layers and more data which
         | is just GPT with more layers and more data. We can go even
         | deeper if we pull from math and physics works. But that doesn't
         | mean these works haven't been highly fruitful and useful. I'm
         | happy all of these have been published.
         | 
         | > because of the hype around LLMs
         | 
         | I too hate the hype, but it is often bimodal. There are people
         | who are far too critical and people who are far too accepting.
         | The harm is not preprints or people reading papers, the harm is
         | people who have no business/qualifications evaluating works
         | confidently spouting out critiques. It is people not
         | understanding that researchers are just critical of one
         | another's work by default and that doesn't mean it shouldn't
         | have been published.
         | 
         | It is well known that reviewers are good at identifying bad
         | papers but not good at identifying good papers[0,1]. Which
         | let's be honest, that means reviewers just have high reject
         | rates in a noisy system. Making publication as a metric for
         | merit a highly noisy one at best.
         | 
         | As for the paper:
         | 
         | Many LLMs and large models are using attention approximations.
         | Nor is the kNN technique particularly new. My main complaints
         | are a lack of comparisons for Figure 3 and 4, but I'm not a NLP
         | person so I don't even know if there's some other good works
         | that can compare better (BART is a common baseline). But
         | generative models are (unfortunately not notoriously known)
         | extremely difficult to evaluate. Paper seems fine to me. It is
         | useful to the community. I don't like the name either, but
         | their input is limited by computer memory, not the model. I
         | would want to see more on this. Not a NLP person all I can say
         | is that this looks neither like a strong reject nor a strong
         | accept. I'll leave it to the community to determine if they
         | want more experiments for the conference publication or not but
         | the work seems useful.
         | 
         | [0] https://inverseprobability.com/talks/notes/the-neurips-
         | exper...
         | 
         | [1] https://arxiv.org/abs/2109.09774
        
         | ShamelessC wrote:
         | > This idea is quite similar to retrieval transformers and
         | Hopfield networks which have been known and published for
         | several years now. It's not really that novel.
         | 
         | Is it? I had thought retrieval transformers "merely" used
         | retrieval as a backend of sorts rather than a substitute for
         | the attention itself?
        
           | mxwsn wrote:
           | Yeah, RETRO [0] embeds all an entire question/prompt, and
           | searches for similar text passages with k-NN, then does
           | further processing. This can kind of be understood as
           | attention on paragraphs. This preprint instead does k-NN and
           | calls it attention on single tokens. So not the same. But
           | similar.
           | 
           | [0] https://jalammar.github.io/illustrated-retrieval-
           | transformer...
        
             | make3 wrote:
             | retro doesn't attend itself, which is a big difference
        
             | ShamelessC wrote:
             | Ah, I see - thanks for the clarification.
        
         | chaxor wrote:
         | There's nothing really wrong with a preprint making it to the
         | top - there can be genuinely good work that stays in preprint
         | for quite some time. I believe the original ELMo work that
         | spurred the Sesame street gang is _still_ in preprint despite
         | its importance in NLP (:shocked Pikachu face: not a
         | transformer?!).
         | 
         | But yes, you're correct in this instance that it's not
         | necessarily 'huge news' since it is highly similar to a long
         | list of the Reformer (LSH-based), Performer (FAVOR**), FNet
         | (Fourier-based), Routing Transformer, Sparse Transformer,
         | Longformer (task specific sparse), blockbert, XLNet/xfmr-xl
         | (slide + relative PE), BP-Transformer (binary partition),
         | BigBird (global and rand attn), RWKV which is..., etc.
         | 
         | ** FAVOR actually is innovative and different in this space,
         | but towards _similar_ ends anyway
        
           | visarga wrote:
           | How come you know the efficient-transformers family, when I
           | ask questions about transformers in ML interviews nobody has
           | heard of them. Can't figure out why it's not common
           | knowledge. For years all the transformer papers were about
           | reducing O( N^2 )
        
             | f_devd wrote:
             | To be fair ML is (used to be?) pretty broad, so unless
             | someone is actively keeping up with the sota in the high-
             | data sequence modeling area it's quite possible to miss. I
             | know ML teams which were entirely made up of OSML
             | practicioners, because that was the most commonly useful
             | until recently.
        
         | ftxbro wrote:
         | > I generally don't enjoy seeing preprints like this going to
         | the top of Hacker News. This would be a higher quality
         | submission if the paper was peer-reviewed or put into a greater
         | context, like a blog post discussion or something like that.
         | 
         | This opinion seems totally backwards to me. I'm not sure what
         | you think peer-reviewed means? Also I prefer full preprints
         | than blog posts. But then again, I have no idea why ones like
         | the daily blogposts of Seth Godin (to pick on one randomly,
         | sorry it's not personal) so often go to the top of hacker news.
         | Maybe opinions like yours explains it?
        
           | MacsHeadroom wrote:
           | > This opinion seems totally backwards to me.
           | 
           | I agree.
           | 
           | > I'm not sure what you think peer-reviewed means?
           | 
           | Posting to HN is a form of peer-review, typically far better
           | than the form of "peer-review" coopted by journal publishers.
        
             | xg15 wrote:
             | That's redefining what "peer-review" is. And I'll take
             | credentialism over some board of anonymous internet people,
             | I'm sorry.
             | 
             | I mean, hypothetically, this whole thread could be stuffed
             | with sock puppet accounts of the author. How would you
             | know?
        
             | pyth0 wrote:
             | > Posting to HN is a form of peer-review, typically far
             | better than the form of "peer-review" coopted by journal
             | publishers.
             | 
             | This is a rather self-aggrandizing view, and I think it
             | speaks to the level of ego that underpins a lot of the
             | discussion on here.
        
               | 19h wrote:
               | There's no need to attack the entire HN community over
               | one person's opinion. Preprints and discussions here both
               | have value, and different forms of review suit different
               | needs.
        
               | pyth0 wrote:
               | This was not an attack against the community or the paper
               | in question. I am only speaking from my experience as
               | (primarily) a lurker.
        
               | 19h wrote:
               | My apologies, I misinterpreted your comment. You make a
               | fair point that HN discussions are not equivalent to
               | formal peer review.
        
               | godelski wrote:
               | There's a lot of junk comments on HN but there's also a
               | lot of junk comments at top conferences like CVPR, ICCV,
               | and NIPS. The system is just noisy. I've had plenty of
               | inane reviews that clearly break reviewer guidelines (ACs
               | do nothing)[0,1].
               | 
               | Also, I want to remind everyone that ML uses conferences
               | as the main publishing mechanism, not journals. While
               | things like JMLR exist, that's not where papers are
               | targeting.
               | 
               | Maybe we just need to let researchers evaluate works
               | based on their merits and not concern ourselves with
               | things like popularity, prestige, and armchair experts'
               | opinions. The latter seems antiscientific to me. We need
               | to recognize that the system is noisy and Goodhart shows
               | us we aren't optimizing merit.
               | 
               | [0] an example is that I had a strong reject with 2 lines
               | of text. One stating that it wasn't novel (no further
               | justification) and the other noting a broken citation
               | link to the appendix. No comments about actual content.
               | 
               | [1] As another example, I've had reviewers all complain
               | because I didn't compare one class of model to another
               | and wasn't beating their performance. I beat the
               | performance of my peers, but different models do
               | different things. Image quality is only one metric. You
               | wouldn't compare PixelCNN to StyleGAN.
        
               | xg15 wrote:
               | > _Maybe we just need to let researchers evaluate works
               | based on their merits and not concern ourselves with
               | things like popularity, prestige, and armchair experts '
               | opinions._
               | 
               | Ok, but how would the researchers communicate their
               | evaluation to non-experts? (Or other experts who didn't
               | have the time to validate the paper)
               | 
               | Isn't that exactly what a review is?
               | 
               | My impression is the armchair experts are more likely to
               | be found on HN.
        
               | Grimblewald wrote:
               | >This is a rather self-aggrandizing view, and I think it
               | speaks to the level of ego that underpins a lot of the
               | discussion on here.
               | 
               | I'm not so sure about that. I've read a lot of things
               | that should have never left peer review or editing
               | stages, while some of the most impotent papers for my
               | field never left preprint.
               | 
               | Overall I think the most imprortant step of peer review
               | is you as the reader in the field. Peer review should
               | catch the worst offenders out, saving us all some time,
               | but it should never be viewed as a seal of approval.
               | Everything you read should be critically evaluated as if
               | it were a preprint anyway.
        
               | pyth0 wrote:
               | I realize some people have taken my comment to be
               | speaking on the efficacy of the peer review process but
               | that was not my intent. I have no experience reading or
               | reviewing papers, or with the journal publication
               | process. My point was more to the fact that HN is a
               | public forum in which anyone can participate and so
               | elevating it above (what I hope are) subject matter
               | experts seemed rather arrogant. To be fair, the OP has
               | since expanded with a more complete comment and it seems
               | to be a similar sentiment to the things you and a couple
               | others have shared.
        
               | freeone3000 wrote:
               | Having been on a paper review board, the selection
               | process is essentially credentialism for credentialism's
               | sake. Anyone who's done a paper or two is deemed to be
               | qualified, and as it's unpaid, uncredited bonus work on
               | top of your day job, the slots aren't competed for very
               | hard.
               | 
               | I would say the primary difference between a conference
               | peer review board and HN is that the author is obliged to
               | respond to the reviewers on the board. I would not say
               | there's any particular difference in qualifications.
        
               | xg15 wrote:
               | > _Anyone who's done a paper or two_
               | 
               | That already narrows it down greatly compared to the
               | general public you find on the internet.
        
               | JoshuaDavid wrote:
               | Do you think it's _factually incorrect_ that the HN
               | comment section is more likely to find problems which
               | invalidate the conclusions of the paper than the journal-
               | driven peer review process?
        
               | mrbungie wrote:
               | I think that it depends on what journal we are talking
               | about. Most of them have some biases in their processes,
               | just as HN commenters also do.
        
               | anonymousDan wrote:
               | Yes?
        
               | JoshuaDavid wrote:
               | On reflection, I probably agree that the answer is "yes"
               | to the question as I phrased it. I think that if you take
               | a random paper, the peer reviewers probably _do_ have
               | much more useful feedback than HN would.
               | 
               | However, if you limit the question to "papers which make
               | bold conclusions of the type that generates lots of
               | discussion on HN", I think HN will be more likely to find
               | methodological flaws in those papers than the peer review
               | process would. I think that's mostly because papers are
               | optimized pretty hard not to have any problems which
               | would cause them to be rejected by the peer review
               | process, but _not_ optimized very hard to not have other
               | problems.
               | 
               | Which means, on average, I expect the HN comment section
               | to have more interesting feedback about a paper, _given
               | that it 's the sort of paper that gets lots of HN
               | discussion_, and also _given that the author put a lot of
               | effort into anticipating and avoiding the concerns that
               | would come up in the peer review process_.
               | 
               | Which, to a reader of HN, looks like "a lot of peer-
               | reviewed papers have obvious flaws that are pointed out
               | by the HN commentariat".
               | 
               | I do think, on the object level, a pre-print which the
               | author intends to publish in a reputable journal will be
               | improved _more_ by fixing any problems pointed out by HN
               | commenters than by fixing any problems pointed out by
               | peer reviewers, and as such I think  "post the pre-print
               | on HN and collect feedback before peer review" is still a
               | good step if the goal is to publish the best paper
               | possible.
        
               | pyth0 wrote:
               | This is a considerably more thoughtful comment and I
               | appreciate your reflection. I also can see how my initial
               | response was a little broad and over-generalizing. I do
               | think there is an interesting conversion in there about
               | whether a group of technically minded people outside the
               | "in group" of the peer reviewer circle (of whatever paper
               | in question) could offer different and potentially
               | important feedback.
               | 
               | Although I should add I have no background in academia
               | and don't feel prepared to have that discussion.
        
         | cs702 wrote:
         | After a very quick read, that's my understanding too: It's just
         | KNN search with some bells and whistles. So I agree on points
         | 1-3.
         | 
         | When something works well, I don't care much about point 4.
         | 
         | Personally, I've had only mixed success with KNN search on long
         | sequences. Maybe I haven't done it right? I don't know. In my
         | experience, nothing seems to work quite as well as explicit
         | token-token interactions by some form of attention, which as we
         | all know is too costly for long sequences (O(n2)). Lately I've
         | been playing with https://github.com/hazyresearch/safari ,
         | which uses a lot less compute and seems promising, though it
         | reminds me of things like FNet. Otherwise, for long sequences
         | I've yet to find something better than
         | https://github.com/HazyResearch/flash-attention for nxn
         | interactions and https://github.com/glassroom/heinsen_routing
         | for nxm interactions. If anyone has other suggestions, I'd love
         | to hear about them.
        
       | ztratar wrote:
       | Given the model performance is thus affected by a k-nearest
       | neighbor, but those algorithms are proving not great for baseline
       | vector search, how well will this actually work?
       | 
       | It seems mostly like a vertically integrated vector DB + existing
       | LLM call, but correct me if I'm wrong. There are of course some
       | performance gains with that, but the holy grail of
       | "understanding" at unlimited length still seems unsolved.
        
         | mrbungie wrote:
         | Isn't the performance (as in the capacity of retrieval, not
         | performance as compute/memory usage) of kNN mostly given by the
         | quality of the vectors/embeddings themselves?
         | 
         | Most vector DBs use (at least) some kind of KNN anyways.
        
       | smusamashah wrote:
       | What does it mean for ChatGPT and likes? Can they employ this
       | method to virtually get rid of context tokens limit?
        
         | Kranar wrote:
         | Yes it looks like it can use this method. This method is a
         | preprocessor and post-processor that can be used on an existing
         | GPT model to augment it to handle unlimited tokens.
        
       | ftxbro wrote:
       | Other times this was put on hacker news:
       | 
       | https://news.ycombinator.com/item?id=35823039
       | 
       | https://news.ycombinator.com/item?id=35803470
        
         | swores wrote:
         | While I appreciate your intent and effort - I don't think it's
         | actually useful to link to other submissions unless either they
         | have comments (ideally only if there's at least one interesting
         | comment, but at least more than no comments at all), or if it's
         | a submission of the same subject but to a different source link
         | - in which case it's probably more useful to just link the
         | alternative source, if it's worth reading, rather than
         | potentially split the discussion into separate comment threads
         | if the other is empty.
         | 
         | Linking to a different submission of the same link with 0
         | comments doesn't add anything.
        
           | ftxbro wrote:
           | I must have submitted it at the wrong time of day.
        
             | swores wrote:
             | Sure or just random luck, maybe this submission just
             | happened to take place when the only few people who care
             | about this subject happened to come online, or vice versa
             | for bad luck before etc.
             | 
             | But unlike sites like Reddit, with the exception of self /
             | ask HN / etc posts, nobody really pays attention to who the
             | submitter is, so enjoy the conversation finally breaking
             | out on it as consolation for not getting karma points, but
             | skip linking to dead submissions :)
             | 
             | FYI, if you ever submit something that fails to get any
             | traction / upvotes, then I've seen mods say before (@dang
             | will hopefully correct me if I'm wrong) that a) it's OK to
             | try submitting a second time maybe after a day or so (but
             | not keep submitting over and over) or b) send the mods an
             | email with a brief reason why it's a link that should
             | interest HN readers for it to be potentially added to a
             | "second chance pool". Though in the case of this link,
             | between three of you it was posted two days ago, one day
             | ago, and today which has finally got a bit more notice, so
             | worked out alright in the end :)
        
       ___________________________________________________________________
       (page generated 2023-05-05 23:00 UTC)