Prompt Engineering vs. Blind Prompting

rw-book-cover

Metadata

  • Author: Mitchell Hashimoto
  • Full Title: Prompt Engineering vs. Blind Prompting
  • Document Note: This document explains the difference between “prompt engineering” and “blind prompting” and how to effectively extract information from language models using prompts to solve real-world problems. A demonstration set is used to measure prompt accuracy by using inputs and expected outputs to determine whether the prompts generated through prompt engineering can reliably solve a specific problem for an application. The document emphasizes that prompt engineering is a real skill that can be developed based on real experimental methodologies and informs decision-making in building prompts based on cost vs. accuracy analysis, tokens required, and accuracy presented. The process allows for trust but verification and continuous improvement to increase accuracy.
  • URL: https://mitchellh.com/writing/prompt-engineering-vs-blind-prompting

Highlights

  • A lot of people who claim to be doing prompt engineering today are actually just blind prompting.1 “Blind Prompting” is a term I am using to describe the method of creating prompts with a crude trial-and-error approach paired with minimal or no testing and a very surface level knowedge of prompting. Blind prompting is not prompt engineering. (View Highlight)
  • This table shows prompt candidates on the Y-axis, and prompt types using those prompts on the X-axis. The value is the accuracy as a percentage of correct answers. (View Highlight)
  • A note for experienced prompters: the few-shot example doesn’t say something like “mimic the examples below.” Experimental research has shown this doesn’t reliably increase accuracy, so I like to test without it first to limit tokens. Second, the few-shot example exemplars don’t ever show the “MM/DD” extraction as an example, which is poor form. In a real few-shot setting, demonstrating all styles of extraction can be important (View Highlight)
  • Finally, you choose a one of the prompt candidates to integrate into your application. This isn’t necessarily the most accurate prompt. This is a cost vs. accuracy analysis based on model used, tokens required, and accuracy presented. (View Highlight)
  • decide to go back and test other approaches to increase accuracy. For (View Highlight)
  • lower cost model to see if that improves the accuracy enough. Sometimes, using more tokens on a lower cost model will save significant money vs. low-token usage on a higher cost model. For example, GPT-4 is ~15x more expensive than GPT-3.5 today. That means that you effectively have 15x more token budget to increase the GPT-3.5 prompt accuracy (caveats around rate limits noted). (View Highlight)
  • Due to the probabilistic nature of generative AI, your prompt likely has some issues. Even if your accuracy on your test set is 100%, there are probably unknown inputs that produce incorrect outputs. (View Highlight)
  • Verification is highly dependent on the problem. (View Highlight)
  • This blog post demonstrates how developing a prompt can — in my opinion — be an engineering practice. It describes a systematic approach to identifying a problem, forming solutions, validating those solutions, and applying continuous improvement to refine those solutions. (View Highlight)
  • Additionally, everyone is rapidly moving to higher-order LLM integrations: prompt chaining, agents, etc. Some people argue that future innovations such as these and more will make human prompting obsolete. (View Highlight)