Optimizing In-Context Learning: Golden Examples for LLMs
Large Language Models (LLMs) have become pivotal in various applications, and a key technique for guiding their behavior without extensive retraining is In-Context Learning (ICL). This method involves providing an LLM with examples of input-output pairs, allowing it to infer the desired pattern or format before tackling a new task. Strategies range from “one-shot” (a single example) to “few-shot” (multiple examples) and “chain-of-thought” (demonstrating step-by-step reasoning).
Consider a simple query: asking an LLM, “What animal makes the sound ‘moo’ and what is its type?” Without guidance, an LLM like ChatGPT might provide a verbose answer, including extraneous details about other animal types. However, by prefacing the query with examples like “User: What animal makes the ‘woof’ sound and what is its type? Assistant: Dog, mammal” and “User: What animal makes the ‘meow’ sound and what is its type? Assistant: Cat, mammal,” the LLM learns to produce the concise, desired format: “Cow, mammal.” This demonstrates ICL’s power in steering an LLM’s output without the resource-intensive process of fine-tuning the model itself.
While ICL is highly effective in boosting LLM performance and accuracy, it suffers from a significant drawback: its fragility. The success of ICL is remarkably sensitive to the specific examples chosen, their order, and even minor formatting changes. This is because ICL operates more through superficial pattern matching than true conceptual understanding. For complex tasks like code repair or converting natural language to SQL queries, a slight alteration in example selection can drastically impact accuracy. The core challenge, then, becomes: how does one systematically select examples that genuinely aid the LLM, rather than just any examples?
Addressing this critical issue, Google DeepMind’s research paper, “AuPair: Golden Example Pairs for Code Repair,” introduces a systematic approach to example selection, specifically for fixing buggy code. Unlike traditional methods that rely on random selection or similarity searches from a pre-curated pool, AuPair focuses on generating and extracting the most effective “golden” example pairs.
AuPair’s methodology unfolds in two distinct phases. The first phase, Example Pair Generation, aims to create a vast collection of candidate “repair pairs.” It begins with a dataset of coding problems that include test cases. An LLM is prompted to generate an initial solution (a “guess”). If this guess is partially correct, it’s used as a starting point. The crucial step involves asking the LLM to fix this broken code, using a few-shot prompt informed by a small set of existing, randomly selected repair pairs. If the LLM-generated fix improves upon the original guess, this “guess → fix” becomes a candidate pair. Ingeniously, if the fix is still imperfect, it’s re-fed into the process as a new “broken” code, creating chains of incremental improvements. This iterative, self-improving loop is repeated thousands of times, building a rich pool of candidate pairs covering diverse bug types and their solutions.
The second phase, Golden Pair Extraction, focuses on identifying the most impactful pairs from this generated pool. This involves a two-step process: measuring effectiveness and then applying a greedy selection algorithm. To measure effectiveness, AuPair creates a validation dataset of broken code problems. Each candidate repair pair is then used as a one-shot example to generate a fix for every problem in the validation set. The resulting fix is tested against unit cases, yielding a score. This process generates a comprehensive “quality matrix,” mapping how well each candidate pair helps solve various validation problems. With effectiveness quantified, a greedy algorithm selects the “golden” pairs. It picks the candidate pair that yields the highest average score across all validation problems. Crucially, the contribution of this selected pair is then “subtracted” from all remaining pairs in the matrix. This ensures that subsequent selections are complementary, preventing redundancy and prioritizing pairs that address different types of problems or offer unique insights. This iterative selection continues until the marginal improvement falls below a set threshold, resulting in an ordered list of highly effective, non-redundant golden pairs.
AuPair’s efficacy was rigorously tested across seven coding problem datasets using five different LLM models. It consistently outperformed alternative approaches like self-reflection and best-of-N sampling in problem-solving. A notable finding was AuPair’s superior computational efficiency: just 12 golden pairs achieved the same performance as 32 randomly selected pairs, representing a 2-3x improvement. Furthermore, golden pairs generated on one dataset (CodeForces) demonstrated strong performance on entirely different ones (HackerEarth and AtCoder), underscoring their transferability within the same domain.
Despite its promising results, AuPair has limitations. The initial generation of candidate example pairs, with its iterative repair process and numerous LLM calls, demands substantial computational resources. Moreover, the method heavily relies on quantifiable evaluation metrics, such as unit tests for code, which may not be readily available in all domains. It also assumes that complementary examples will consistently lead to better performance, an assumption that, while valid for coding problems, might not hold universally. Finally, AuPair was benchmarked against structured contest problems, leaving its performance on more complex, real-world codebases an open question.
Nevertheless, AuPair represents a significant leap forward in optimizing in-context learning for specific domains. By moving beyond arbitrary example selection towards a systematic approach for identifying truly effective patterns, it enhances LLM performance and efficiency. While it necessitates a considerable upfront computational investment and domain-specific evaluation metrics, the proven transferability of its “golden pairs” across datasets suggests a worthwhile return. This research paves the way for applying similar intelligent example selection techniques to other complex tasks, such as text-to-SQL generation, where the effectiveness of examples can be systematically measured and curated.