Taking the same prompt as input, how do different models (and agent scaffolds) perform on the same tasks? This is a collection of results from prompts I like testing new models with.