fsebugoutzone.org:9999

       Post 9wNyHP9cqmXa26yaZs by tfidf@fosstodon.org
 (DIR) More posts by tfidf@fosstodon.org
 (DIR) Post #9wMsEU1C0Zsw2Y8sKm by djoerd@idf.social
       2020-06-23T07:45:39Z
       
       0 likes, 0 repeats
       
       Good to read Sakai&#39;s reply to Fuhr’s Guidelines for Information Retrieval Evaluation:http://sigir.org/wp-content/uploads/2020/06/p14.pdf
       
 (DIR) Post #9wN1jQJoiElL89gKmG by tfidf@fosstodon.org
       2020-06-23T09:32:06Z
       
       0 likes, 0 repeats
       
       @djoerd Sakai made some good points. But. MAP&#39;s user model was reverse engineered decades after its inception. And what user model would actually consider the differences between ranks 1 and 2 and between 2 and inf. the same?Maybe we as an IR community should identify classes of user models (e.g. &quot;adhoc&quot;, &quot;automated&quot;) and identify the best known measure for each class. Otherwise papers will be tempted to cherry-pick the measure that best shows the advantage of the paper&#39;s contribution.
       
 (DIR) Post #9wNONYuOOPAgLYKnBo by arjen@idf.social
       2020-06-23T12:39:40Z
       
       0 likes, 0 repeats
       
       @tfidf @djoerd well, most researchers do! We report NDCG@20 as a model of first result page quality, MAP as a model averaging over all users, and P@5 as a model of early precision. Each have their pro and cons, that is why you should not report just one. And, it should match the use-case too!
       
 (DIR) Post #9wNONdNxkPCYEQCSps by djoerd@idf.social
       2020-06-23T13:45:49Z
       
       0 likes, 0 repeats
       
       @arjen @tfidf I personally like MAP a lot as a measure (assuming binary relevance). According to Buckley &amp; Voorhees, &quot;Average Precision seems to be a reasonably stable and discriminating choice.&quot;http://www.sigir.org/wp-content/uploads/2017/06/p235.pdf
       
 (DIR) Post #9wNOfqnWcTTt2dqJvM by djoerd@idf.social
       2020-06-23T13:49:10Z
       
       0 likes, 0 repeats
       
       @arjen @tfidf I think for MAP, the difference between rank 1 and 2 is usually smaller than between 2 and infinity (assuming there are multiple relevant documents). If there is one relevant document MAP equals MRR, and yes, the difference between rank 1 and 2 is big.
       
 (DIR) Post #9wNhHveKpMfcGgqp96 by tfidf@fosstodon.org
       2020-06-23T17:17:43Z
       
       0 likes, 0 repeats
       
       @djoerd @arjen I&#39;m a bit behind with regard to reading papers but there has been at least one paper that showed ad-hoc users can adapt well if retrieval engine effectiveness deteriorates. So ranks 1 and 2 might not make a big difference if the snippets are good, while 2 and inf. certainly do.
       
 (DIR) Post #9wNptzoqZZeLpyQld2 by djoerd@idf.social
       2020-06-23T18:54:15Z
       
       0 likes, 0 repeats
       
       @tfidf @arjen I see your point. What measure would capture that better?
       
 (DIR) Post #9wNq2fc14T3ZlyZ7K4 by arjen@idf.social
       2020-06-23T18:55:49Z
       
       0 likes, 0 repeats
       
       @djoerd @tfidf Maybe RBP captures it more? P@20 together with MAP is pretty good too, not?
       
 (DIR) Post #9wNqeBEiSKUGZHYLw0 by djoerd@idf.social
       2020-06-23T19:02:36Z
       
       0 likes, 0 repeats
       
       @arjen @tfidf Rank Biased Precision by Moffat &amp; Zobel, right? (Now I also need to read up on my papers)
       
 (DIR) Post #9wNyHP9cqmXa26yaZs by tfidf@fosstodon.org
       2020-06-23T20:28:08Z
       
       0 likes, 0 repeats
       
       @djoerd @arjen how about nDCG@6? We can make it 10. But 20 is way too much in my opinion, if we want to measure just the first result page from a user&#39;s perspective.