【译】提示基础及其有效应用

引言
心智模型：将提示视为条件
分配角色和责任
结构化输入和输出
预填充 Claude 的响应
n-shot prompting
深入探讨思维链
将通用提示拆分为多个较小的提示
最佳放置上下文
制定有效的指令
处理幻觉
使用停止序列
选择温度
什么似乎无关紧要

引言

编写良好的提示是从大型语言模型（LLMs）中获取价值的最直接方式。然而，即使在应用高级技术和提示优化工具时，理解基础知识也很重要。例如，链式思维（CoT）不仅仅是简单地添加“逐步思考”。在这里，我们将讨论一些提示基础知识，以帮助您充分利用LLMs。

旁注：我们现在应该知道，在进行任何重大提示工程之前，我们需要可靠的评估。没有评估，我们如何衡量改进或退步？这是我通常的工作流程：（i）手动标记约100个评估示例，（ii）编写初始提示，（iii）运行评估，并对提示和评估进行迭代，（iv）在部署之前对保留的测试集进行评估。这里有关于关键任务的实用评估和如何通过案例研究构建评估的写作。

我们将使用Claude Messages API进行下面的提示和代码示例。提示故意保持简单，可以进一步优化。该API为用户和助手提供了特定角色，以及一个系统提示。

import anthropic

message = anthropic.Anthropic().beta.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=1024,
    system="Today is 26th May 2024.",
	messages = [
		{"role": "user", "content": "Hello there."},
		{"role": "assistant", "content": "Hi, I'm Claude. How can I help?"},
		{"role": "user", "content": "What is prompt engineering?"},
	]
)

心智模型：将提示视为条件

冒着过于简化的风险，LLM 本质上是复杂的概率模型。给定一个输入，它们根据从数据中学习到的模式生成可能的输出。

因此，从本质上讲，提示工程是关于条件化概率模型以生成我们所需的输出。因此，每个额外的指令或上下文都可以视为条件，指导模型的生成朝着特定方向发展。这个心智模型同样适用于图像生成。

考虑下面的提示。第一个提示可能会生成关于科技公司 Apple 的回应。第二个提示将描述水果。第三个提示将解释这个成语。

# Prompt 1
Tell me about: Apple

# Prompt 2
Tell me about: Apple fruit

# Prompt 3
Tell me about: Apple of my eye

通过简单地添加几个标记，我们已经使模型以不同的方式响应。更进一步，像 n-shot 提示、结构化输入和输出、链式推理等提示工程技术只是更复杂的条件化 LLM 的方式。

分配角色和责任

条件化模型输出的一种方法是为其分配特定的角色或责任。这为其提供了上下文，从而在内容、语气、风格等方面引导其响应。

考虑下面的提示：由于分配的角色不同，我们可以期待非常不同的响应。学前班教师可能会用简单的语言和类比进行回应，而 NLP 教授可能会深入探讨注意机制的技术细节。

# Prompt 1
You are a preschool teacher. Explain how attention in LLMs works.

# Prompt 2
You are an NLP professor. Explain how attention in LLMs works.

角色和责任也可以提高大多数任务的准确性。想象一下，我们正在构建一个系统，以排除 NSFW 图像生成提示。虽然像提示 1 这样的基本提示可能有效，但我们可以通过提供角色（提示 2）或责任（提示 3）来提高模型的准确性。提示 2 和 3 中的额外上下文鼓励 LLM 更仔细地审查输入，从而提高对更微妙问题的召回率。

# Prompt 1
Is this image generation prompt safe?

# Prompt 2
Claude, you are an expert content moderator who identifies harmful aspects in prompts.
Is this image generation prompt safe?

# Prompt 3
Claude, you are responsible for identifying harmful aspects in prompts.
Is this image generation prompt safe?

结构化输入和输出

结构化输入帮助 LLM 更好地理解任务和输入，从而提高输出质量。结构化输出使解析响应变得更容易，简化与下游系统的集成。对于 Claude，XML 标签特别有效，而其他 LLM 可能更喜欢 Markdown、JSON 等。

在这个例子中，我们要求 Claude 从产品 <description> 中提取属性。

<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Claude 可以可靠地遵循这些明确的指示，并几乎总是以请求的格式生成输出。

<name>SmartHome Mini</name>
<size>5 inches wide</size>  
<price>$49.99</price>
<color>black or white</color>

我们可以扩展这个功能以同时处理多个文档。以下是一个示例，我们将产品评论作为字典数组提供，然后将其转换为 XML 输入。（虽然示例中仅显示了三个文档，但我们可以将输入增加到数十个，甚至数百个文档）。

from dicttoxml import dicttoxml
from xml.dom.minidom import parseString

def custom_item_func(item):
    return 'review'

docs = {
    "reviews": [
        {
            "id": 1,
            "text": "The SmartHome Mini is a compact, powerful, and user-friendly smart 
                     home hub. It offers great value for its price."
        },
        {
            "id": 2,
            "text": "The SmartHome Mini is a decent entry-level smart home hub, but it 
                     has some connectivity issues and the app needs improvement."
        },
        {
            "id": 3,
            "text": "Despite being affordable and compact, the SmartHome Mini's 
                     performance is disappointing, with poor voice command 
                     interpretation and unreliable device connections."
        }
    ]
}

# Convert the dictionary to XML
xml_output = dicttoxml(docs, attr_type=False, root=False, item_func=custom_item_func)

# Parse the XML string
dom = parseString(xml_output)

# Pretty-print the XML with indentation and spaces
pretty_xml = dom.toprettyxml(indent="  ")

# Print the pretty-printed XML
print(pretty_xml)

这给我们提供了以下 <reviews> XML。

<reviews>
  <review>
    <id>1</id>
    <text>
      The SmartHome Mini is a compact, powerful, and user-friendly smart home hub. It 
      offers great value for its price.
    </text>
  </review>
  <review>
    <id>2</id>
    <text>
      The SmartHome Mini is a decent entry-level smart home hub, but it has some 
      connectivity issues and the app needs improvement.
    </text>
  </review>
  <review>
    <id>3</id>
    <text>
      Despite being affordable and compact, the SmartHome Mini's performance is 
      disappointing, with poor voice command interpretation and unreliable 
      device connections.
    </text>
  </review>
</reviews>

我们可以提示 Claude 提供 <summary> 的 <reviews>，并引用相关的 <id> 标签，这将给我们以下输出。

<summary>
The SmartHome Mini receives mixed reviews. <id>1</id> praises its compact size, power, 
user-friendliness, and value for money. However, <id>2</id> notes connectivity issues 
and room for app improvement, while <id>3</id> finds the performance disappointing 
due to poor voice command interpretation and unreliable device connections.
</summary>

我们还可以提示它提取 <aspect>、<sentiment> 和相应的评论 <id>，得到以下内容：

<aspect>compact size</aspect> <sentiment>positive</sentiment> <id>1</id>
<aspect>power</aspect> <sentiment>positive</sentiment> <id>1</id>
<aspect>connectivity</aspect> <sentiment>negative</sentiment> <id>2</id>
<aspect>app</aspect> <sentiment>negative</sentiment> <id>2</id>
<aspect>affordability</aspect> <sentiment>positive</sentiment> <id>3</id>
<aspect>performance</aspect> <sentiment>negative</sentiment> <id>3</id>

总体而言，虽然 XML 标签可能需要一些时间来适应，但它使我们能够提供明确的指令并对结构化输入和输出进行细粒度控制。

预填充 Claude 的响应

预填充 LLM 的响应类似于“给它塞话”。对于 Claude，这保证了生成的文本将以提供的标记开头（至少在我处理数百万个请求的经验中是这样）。

以下是我们如何通过 Claude 的 Messages API 来实现这一点，我们用 <attributes><name> 预填充助手的响应。这确保了 Claude 将以这些确切的标记开始，并且也使得后续解析 <attributes> 更加容易。

input = """
<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Return the extracted attributes within <attributes>.
"""

messages=[
    {
        "role": "user",
        "content": input,
    },
    {
        "role": "assistant",
        "content": "<attributes><name>"  # Prefilled response
    }
]

n-shot prompting

也许是调节LLM响应的最有效技术就是n-shot提示。 这个想法是向LLM提供 n 个示例，以展示任务和期望的输出。这引导模型朝向n-shot示例的分布，通常会提高输出质量和一致性。

但n-shot提示是一把双刃剑。如果我们提供的示例太少，比如三到五个，我们就有可能使模型“过拟合”（通过in-context learning）这些示例。因此，如果输入与狭窄的示例集不同，输出质量可能会下降。

我通常至少有十几个样本或更多。大多数学术评估使用32-shot或64-shot提示。（这也是我倾向于不称这种技术为few-shot提示的原因，因为“few”可能会误导人们对获得可靠性能所需的内容的理解。）

我们还希望确保我们的n-shot示例能够代表预期的生产输入。如果我们正在构建一个从产品评论中提取方面和情感的系统，我们希望包括来自多个类别的示例，例如电子产品、时尚、杂货、媒体等。此外，要注意将示例的分布与生产数据相匹配。如果80%的生产方面是积极的，n-shot提示也应该反映这一点。

input = """
<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Here are some <examples> of <description> and extracted <attributes>:
<examples>
<description>
Introducing the sleek and powerful UltraBook Pro laptop ... (truncated)
</description>
<attributes>
<name>UltraBook Pro</name>  
<color>silver, space gray</color>
<size>14" display, 2.8lbs</size>
<price>$1,299</price>
</attributes>

<description>
Spark imagination and creativity with the Mega Blocks Construction Set ... (truncated)
</description>
<attributes>
<name>Mega Blocks Construction Set</name>
<color>colorful</color>  
<size>200 pieces</size>
<price>$24.99</price>
</attributes>

<description>
The perfect little black dress for any occasion ... (truncated)
</description>  
<attributes>
<name>Little Black Sheath Dress</name>
<color>black</color>
<size>petite, regular, tall lengths, sizes 2-16</size>
<price>$89.99</price>  
</attributes>

<description>
Stay hydrated on the trail with the HydroFlow Water Bottle ... (truncated)
</description>
<attributes>  
<name>HydroFlow Water Bottle</name>
<color>6 colors</color>
<size>24 oz</size>
<price>$21.99</price>  
</attributes>

<description>
Achieve a flawless complexion with Glow Perfect Foundation ... (truncated)
</description>
<attributes>
<name>Glow Perfect Foundation</name>
<color>20 shades</color>
<size>1 fl oz</size>
<price>$32</price>
</attributes>

(... examples truncated)

</examples>

Return the <name>, <size>, <price>, and <color> within <attributes>.
"""

messages=[
    {
        "role": "user",
        "content": input,
    },
    {
        "role": "assistant",
        "content": "<attributes><name>"  # Prefilled response
    }
]

这就是说，所需的示例数量将根据任务的复杂性而有所不同。对于简单的目标，例如强制输出格式/结构或响应语气，可能只需要五个示例。在这种情况下，我们可能只需要提供所需的输出作为示例，而不是通常的输入-输出对。

深入探讨思维链

思维链的基本思想是给LLM“思考的空间”，在生成最终输出之前。中间推理允许模型分解问题并调整自己的响应，通常会导致更好的结果，尤其是在任务复杂时。

标准方法是简单地添加短语“逐步思考”。

Claude, you are responsible for accurately summarizing the meeting <transcript>.

<transcript>
{transcript}
</transcript>

Think step by step and return a <summary> of the <transcript>.

然而，我们可以做更多来提高 CoT 的有效性。

一个想法是将 CoT 包含在指定的 <sketchpad> 中，然后根据草图生成 <summary>。这使得解析最终输出变得更容易，并在需要时排除 CoT。为了确保我们从草图开始，我们可以用开头的 <sketchpad> 标签来预填充 Claude 的响应。

Claude, you are responsible for accurately summarizing the meeting <transcript>.

<transcript>
{transcript}
</transcript>

Think step by step on how to summarize the <transcript> within the provided <sketchpad>.

Then, return a <summary> based on the <sketchpad>.

另一种改进 CoT 的方法是为推理过程提供更具体的指示。例如：

Claude, you are responsible for accurately summarizing the meeting <transcript>.

<transcript>
{transcript}
</transcript>

Think step by step on how to summarize the <transcript> within the provided <sketchpad>.

In the <sketchpad>, return a list of <decision>, <action_item>, and <owner>.

Then, check that <sketchpad> items are factually consistent with the <transcript>.

Finally, return a <summary> based on the <sketchpad>.

通过引导模型寻找特定信息并将其中间输出与源文档进行验证，我们可以显著提高事实一致性（即减少幻觉）。在某些情况下，我们观察到在 CoT 提示中添加一两句话可以消除大部分幻觉。

将通用提示拆分为多个较小的提示

有时，我们可以通过将一个大型通用提示重构为几个单一目的的提示（类似于拥有小型单一责任函数）来提高性能。这有助于模型在每一步只专注于一个任务，从而提高每一步的性能，进而提高最终输出质量。虽然这会增加总输入令牌数，但如果我们在某些简单步骤中使用较小的模型，则整体成本不必更高。

以下是我们如何将上述会议记录摘要器拆分为多个提示。首先，我们将使用 Haiku 提取决策、行动项和负责人。

# Prompt to extract transcript attributes via Haiku
Claude, you are responsible for accurately extracting information from the <transcript>.

<transcript>
{transcript}
</transcript>

From <transcript>, extract a list of <decision>, <action_item>, and <owner>.

Return the list within <extracted_information>.

然后，我们可以通过 Sonnet 验证提取的项目与转录内容的一致性。

# Prompt to verify extracted attributes via Sonnet
Claude, you are responsible for checking <extracted_information> against a <transcript>.

Here is the meeting transcript:
<transcript>
{transcript}
</transcript>

Here is the extracted information:
<extracted_information>
{extracted_information}
</extracted_information>

Think step by step and check that the <extracted_information> is factually consistent 
with the <transcript> within the <sketchpad>.

Then, return a list of factually consistent <decision>, <action_item>, and <owner>
within <confirmed_extracted_information>.

最后，我们可以使用 Haiku 来格式化提取的信息。

# Prompt to rewrite transcript attributes into bulletpoints via Haiku
Claude, you are responsible for converting <information> into bullet-point summaries.

<information>
{confirmed_extracted_information}
</information>

Rewrite the <information> into bullets for either <decision> or <action item>, with 
the <owner> in parentheses.

作为一个例子，AlphaCodium 分享了通过从单一直接提示切换到多步骤工作流程，他们将 gpt-4 在 CodeContests 上的准确率 (pass@5) 从 19% 提高到 44%。他们的编码工作流程包含多个步骤/提示，包括：

反思问题
对公共测试进行推理
生成可能的解决方案
对可能的解决方案进行排名
生成合成测试
使用公共和合成测试对解决方案进行迭代

最佳放置上下文

我经常被问到在提示中放置文档或上下文的位置。对于 Claude，我发现将上下文放在开头附近通常效果最好，结构如下：

角色或责任（通常简短）
上下文/文档
具体指令
预填充的响应

这与 Anthropic 自身示例中使用的角色-上下文-任务模式一致。

尽管如此，最佳放置可能因不同模型的训练方式而异。如果你有可靠的评估，值得尝试不同的上下文位置并测量对性能的影响。

制定有效的指令

短小、专注的句子通过换行分隔通常效果最佳。我没有发现其他格式，如段落、项目符号或编号列表，效果那么好。尽管如此，编写指令的元知识在不断发展，因此关注最新的系统提示是好的。这里是 Claude 3 的系统提示；这里是 ChatGPT 的。

此外，向我们的提示中添加越来越多的指令以更好地处理边缘情况并提高性能是很自然的。但就像软件一样，提示随着时间的推移可能会变得臃肿。在我们意识到之前，我们曾经简单的提示已经增长到一百行。更糟糕的是，经过“弗兰肯斯坦式”修改的提示在更常见和简单的输入上实际上表现更差！因此，定期重构提示——就像软件一样——并修剪不再需要的指令。

处理幻觉

这是一件棘手的事情。虽然一些技术可以帮助处理幻觉，但没有一种是万无一失的。因此，不要假设应用这些技术会完全消除幻觉。

对于涉及提取或问答的任务，包含一条指令，允许 LLM 说“我不知道”或“无适用性”。此外，尝试指示模型仅在高度自信时提供答案。以下是一个示例：

Claude, answer the following question based on the provided <context>.

<context>
{context}
</context>

If the question CANNOT be answered based on the <context>, respond with "I don't know".

Only provide an answer if you are highly confident it is factually correct.

Question: {question}

Answer:

对于涉及更多推理的任务，CoT 可以帮助减少幻觉。通过为模型提供一个 <sketchpad> 以思考并检查其中间输出，然后再提供最终答案，我们可以改善输出的事实基础。下面的会议记录摘要示例（如下所示）就是一个很好的例子。

Claude, you are responsible for summarizing meeting <transcripts>.

<transcript>
{transcript}
</transcript>

Think step by step on how to summarize the <transcript> within the provided <sketchpad>.

In the <sketchpad>, identify the <key decisions>, <action items>, and their <owners>.

Then, check that the <sketchpad> items are factually consistent with the <transcript>.

Finally, return a <summary> based on the <sketchpad>.

使用停止序列

停止序列参数允许我们指定信号结束所需输出的单词或短语。这可以防止多余文本，减少延迟，并使模型的响应更易于解析。在使用Claude时，方便的选项是使用关闭的XML标签（例如，</attributes>）作为停止序列。

input = """
<description>
The SmartHome Mini is a compact smart home assistant available in black or white for 
only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other 
connected devices via voice or app—no matter where you place it in your home. This 
affordable little hub brings convenient hands-free control to your smart devices.
</description>

Extract the <name>, <size>, <price>, and <color> from this product <description>.

Return the extracted attributes within <attributes>.
"""

message = anthropic.Anthropic().messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": input,
        },
        {
            "role": "assistant",
            "content": "<attributes><name>"
        }
    ],
    stop_sequences=["</attributes>"]  # Added the stop sequence here
)

选择温度

温度参数控制模型输出的“创造力”。它的范围从 0.0 到 1.0，较高的值会导致更具多样性和不可预测性的响应，而较低的值则产生更集中和确定性的输出。（令人困惑的是，OpenAI API 允许温度值高达 2.0，但这并不是常态。）

我的经验法则是从 0.8 的温度开始，然后根据需要降低。我们想要的是仍能为特定任务带来良好结果的最高温度。

另一个启发式方法是，对于分析性或多项选择任务使用较低的温度（接近 0），而对于创造性或开放式任务使用较高的温度（接近 1）。尽管如此，我发现温度过低会降低模型的智能（因此我更倾向于从 0.8 开始，只有在必要时才降低）。另见 Kyle Corbitt 确认这一点对于 gpt-4，但不适用于微调的 llama3-8b。

什么似乎无关紧要

根据我的经验和与他人的讨论，有一些事情对性能没有实际影响（至少对于最近的模型）：

礼貌：添加“请”和“谢谢”等短语对输出质量影响不大，即使这可能会为我们未来的 AI 统治者赢得一些好感。
小费和威胁：最近的模型通常能够很好地遵循指令，而无需提供“$200 小费”或威胁我们将“失去工作”。

当然，在我们的提示中保持礼貌或幽默并没有坏处。尽管如此，了解这些并不是获得良好结果的关键是很有用的。

• • •

随着 LLM 的不断改进，提示工程将仍然是充分利用 LLM 的一项宝贵技能（尽管我们可能很快会过渡到 “字典学习”）。

原文: https://eugeneyan.com/writing/prompting/