Post 9wNyHP9cqmXa26yaZs by tfidf@fosstodon.org
(DIR) More posts by tfidf@fosstodon.org
(DIR) Post #9wMsEU1C0Zsw2Y8sKm by djoerd@idf.social
2020-06-23T07:45:39Z
0 likes, 0 repeats
Good to read Sakai's reply to Fuhr’s Guidelines for Information Retrieval Evaluation:http://sigir.org/wp-content/uploads/2020/06/p14.pdf
(DIR) Post #9wN1jQJoiElL89gKmG by tfidf@fosstodon.org
2020-06-23T09:32:06Z
0 likes, 0 repeats
@djoerd Sakai made some good points. But. MAP's user model was reverse engineered decades after its inception. And what user model would actually consider the differences between ranks 1 and 2 and between 2 and inf. the same?Maybe we as an IR community should identify classes of user models (e.g. "adhoc", "automated") and identify the best known measure for each class. Otherwise papers will be tempted to cherry-pick the measure that best shows the advantage of the paper's contribution.
(DIR) Post #9wNONYuOOPAgLYKnBo by arjen@idf.social
2020-06-23T12:39:40Z
0 likes, 0 repeats
@tfidf @djoerd well, most researchers do! We report NDCG@20 as a model of first result page quality, MAP as a model averaging over all users, and P@5 as a model of early precision. Each have their pro and cons, that is why you should not report just one. And, it should match the use-case too!
(DIR) Post #9wNONdNxkPCYEQCSps by djoerd@idf.social
2020-06-23T13:45:49Z
0 likes, 0 repeats
@arjen @tfidf I personally like MAP a lot as a measure (assuming binary relevance). According to Buckley & Voorhees, "Average Precision seems to be a reasonably stable and discriminating choice."http://www.sigir.org/wp-content/uploads/2017/06/p235.pdf
(DIR) Post #9wNOfqnWcTTt2dqJvM by djoerd@idf.social
2020-06-23T13:49:10Z
0 likes, 0 repeats
@arjen @tfidf I think for MAP, the difference between rank 1 and 2 is usually smaller than between 2 and infinity (assuming there are multiple relevant documents). If there is one relevant document MAP equals MRR, and yes, the difference between rank 1 and 2 is big.
(DIR) Post #9wNhHveKpMfcGgqp96 by tfidf@fosstodon.org
2020-06-23T17:17:43Z
0 likes, 0 repeats
@djoerd @arjen I'm a bit behind with regard to reading papers but there has been at least one paper that showed ad-hoc users can adapt well if retrieval engine effectiveness deteriorates. So ranks 1 and 2 might not make a big difference if the snippets are good, while 2 and inf. certainly do.
(DIR) Post #9wNptzoqZZeLpyQld2 by djoerd@idf.social
2020-06-23T18:54:15Z
0 likes, 0 repeats
@tfidf @arjen I see your point. What measure would capture that better?
(DIR) Post #9wNq2fc14T3ZlyZ7K4 by arjen@idf.social
2020-06-23T18:55:49Z
0 likes, 0 repeats
@djoerd @tfidf Maybe RBP captures it more? P@20 together with MAP is pretty good too, not?
(DIR) Post #9wNqeBEiSKUGZHYLw0 by djoerd@idf.social
2020-06-23T19:02:36Z
0 likes, 0 repeats
@arjen @tfidf Rank Biased Precision by Moffat & Zobel, right? (Now I also need to read up on my papers)
(DIR) Post #9wNyHP9cqmXa26yaZs by tfidf@fosstodon.org
2020-06-23T20:28:08Z
0 likes, 0 repeats
@djoerd @arjen how about nDCG@6? We can make it 10. But 20 is way too much in my opinion, if we want to measure just the first result page from a user's perspective.