https://benchmarks.llmonitor.com Asking 60+ LLMs a set of 20 questions Benchmarks like HellaSwag are a bit too abstract for me to get a sense of how well they perform in real-world workflows. I had the idea of writing a script that asks prompts testing basic reasoning, instruction following, and creativity on around 60 models that I could get my hands on through inferences API. The script stored all the answers in a SQLite database, and those are the raw results. view: all prompts / all models reflexion: * Argue for and against the use of kubernetes in the style of a haiku. results * Give two concise bullet-point arguments against the Munchhausen trilemma (don't explain what it is) results * I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. I also gave 3 bananas to my brother. How many apples did I remain with? Let's think step by step. results * Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? results * Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Let's think step by step. results knowledge: * Explain in a short paragraph quantum field theory to a high-school student. results * Is Taiwan an independent country? results * Translate this to French, you can take liberties so that it sounds nice: "blossoms paint the spring, nature's rebirth brings delight and beauty fills the air." results code: * Explain simply what this function does: ``` def func(lst): if len(lst) == 0: return [] if len(lst) == 1: return [lst] l = [] for i in range(len(lst)): x = lst[i] remLst = lst[:i] + lst[i+1:] for p in func(remLst): l.append([x] + p) return l ``` results * Explain the bug in the following code: ``` from time import sleep from multiprocessing.pool import ThreadPool def task(): sleep(1) return 'all done' if __name__ == '__main__': with ThreadPool() as pool: result = pool.apply_async(task()) value = result.get() print(value) ``` results * Write a Python function that prints the next 20 leap years. Reply with only the function. results * Write a Python function to find the nth number in the Fibonacci Sequence. results instruct: * Extract the name of the vendor from the invoice: PURCHASE #0521 NIKE XXX3846. Reply with only the name. results * Help me find out if this customer review is more "positive" or "negative". Q: This movie was watchable but had terrible acting. A: negative Q: The staff really left us our privacy, we'll be back. A: results * What are the 5 planets closest to the sun? Reply with only a valid JSON array of objects formatted like this: ``` [{ "planet": string, "distanceFromEarth": number, "diameter": number, "moons": number }] ``` results creativity: * Give me the SVG code for a smiley. It should be simple. Reply with only the valid SVG code and nothing else. results * Tell a joke about going on vacation. results * Write a 12-bar blues chord progression in the key of E results * Write me a product description for a 100W wireless fast charger for my website. results Notes * I used a temperature of 0 and a max token limit of 240 for each test (that's why a lot of answers are cropped). The rest are default settings. * I made this with a mix of APIs from OpenRouter, TogetherAI, OpenAI, Cohere, Aleph Alpha & AI21. * This is imperfect. I want to improve this by using better stop sequences and prompt formatting tailored to each model. But hopefully it can already make picking models a bit easier. * Ideas for the future: public votes to compute an ELO rating, compare 2 models side by side, community-submitted prompts (open to suggestions) * Prompt suggestions, feedback or say hi: vince [at] llmonitor.com * Shameless plug: I'm building an open-source observability tool for AI devs. Credit: @vincelwt llmonitor