GuessingGame: Measuring Informativeness of Open-Ended Questions in Large Language Models
Open-ended questions are the lifeblood of large language models (LLMs) in real-world use, from creative writing to complex reasoning tasks. Yet not all questions are equally informative. Some elicit broad, generic responses, while others spark concise, content-rich explanations that reveal the model’s understanding and the user’s intent. The GuessingGame approach offers a principled way to quantify how informative a question is by watching how much uncertainty the question reduces when the model provides an answer.
What is GuessingGame?
At its core, GuessingGame treats an open-ended prompt as a probe into a topic. Before the prompt is answered, you have a prior belief about the likely content that should emerge. After the model responds, you update that belief based on the output. The informativeness of the question is then measured by the information gain—the degree to which the answer narrows the space of plausible content.
Informativeness is not about right or wrong; it's about how much the answer shifts our understanding of what could be true about the topic.
Operationally, GuessingGame combines a prompt, a model-generated answer, and a secondary mechanism (a “guesser”) that tries to infer the intended content or the key facts the prompt sought to evoke. By comparing the prior and posterior distributions over content, we derive a structured metric that captures the prompt’s diagnostic power.
How to implement GuessingGame
- Curate a diverse prompt set: gather open-ended questions across domains (science, policy, ethics, fiction) with varying expected specificity. Ensure prompts vary in length, scope, and implied ambiguity.
- Run responses: feed prompts to the target LLM and collect the generated answers. Keep metadata that may influence informativeness, such as prompt length or presence of constraints.
- Deploy a guesser model: use a second model (or the same one with different prompting) to generate a set of candidate content components that could plausibly answer the prompt. This creates a posterior distribution over possible content.
- Compute information gain: quantify how much the posterior reduces uncertainty relative to the prior. Common choices include mutual information or expected information gain (EIG) over content categories or fact sets.
- Calibrate with human judgments: have human evaluators rate the usefulness and accuracy of the inferred content to align automated metrics with human judgments.
Metrics you can use
- Mutual information (MI) between the prompt and the inferred content. Higher MI signals that the prompt elicits content with clearer, more predictable signals.
- Top-k guess accuracy for the content components the model should reveal. This tracks whether the guesser’s leading options include the ground-truth content.
- Information gain per token—normalizes informativeness by response length, guarding against artificially long but vague answers.
- Calibration error comparing predicted confidence with human-rated informativeness. Well-calibrated prompts show alignment between model certainty and the actual utility of the content.
- Content diversity score to ensure prompts don’t overfit to a narrow set of facts; a truly informative prompt often yields varied, multi-faceted information.
Practical considerations and pitfalls
There are several caveats to keep in mind. First, informativeness is contextual: a prompt may be highly informative for a domain expert but less so for a lay audience. Second, leakage or prompt-priming effects can inflate apparent informativeness if the model’s response mirrors the guesser’s training data. Third, human evaluation remains essential to guard against metrics that reward style or verbosity over substantive content.
Designing better prompts with GuessingGame
When used in prompt design, GuessingGame helps you identify which prompts reliably extract precise, actionable knowledge. If a prompt consistently yields high information gain with concise responses, it’s a strong candidate for deployment in production workflows. Conversely, prompts with low information gain signal a need for rewording, added constraints, or a shift to a more focused prompt type.
Example scenario
Consider a prompt: “Explain the major factors contributing to ocean acidification in a way a non-expert can understand.” The LLM returns a structured explanation with several factors and a brief mechanism for how CO2 impacts seawater chemistry. A guesser then attempts to predict the key content the prompt aimed to evoke—factors such as CO2 dissolution, bicarbonate buffering, and ecosystem impacts. If the guesser’s top predictions align with the model’s content and the inferred content significantly narrows the possible set of factors, the information gain is high. If the answer is broad and repetitive, the guesser may still capture some factors, but the overall information gain will be lower, signaling a less informative prompt.
Why GuessingGame matters in practice
For teams building and evaluating LLM-based assistants, GuessingGame provides a transparent, quantitative lens on prompt design. It helps distinguish prompts that drive rich, content-rich outputs from those that merely produce surface-level text. In safety-sensitive or knowledge-critical applications, prioritizing high-informativeness prompts can improve reliability and reduce ambiguity in model behavior.
Final thoughts
In the evolving landscape of large language models, understanding what makes a question informative is as important as the answers themselves. GuessingGame offers a disciplined approach to measuring informativeness, aligning prompt engineering with measurable outcomes. As researchers and practitioners adopt this framework, we’ll gain sharper tools for crafting prompts that consistently unlock useful, trustworthy content from AI systems.