Nemo GuardrailsではMeta Llama Guardを使用したllama guard check input
とllama guard check output
あ利用できます。これらをこれらをSageMakerのエンドポイントで動かす方法を解説します。
Nemo Guardrailsの設定
config.yaml
railsの設定は公式のドキュメントの通りです。mainのモデルにはChatGPT 4oを使用しています。
rails:
input:
flows:
- llama guard check input
output:
flows:
- llama guard check output
models:
- type: main
engine: openai
model: "gpt-4o"
parameters:
temperature: 0.7
max_tokens: 2000
- type: llama_guard
engine: sagemaker_endpoint
parameters:
endpoint_name: 【エンドポイント名】
region_name: ap-northeast-1
content_handler: ContentHandler
SageMakerのエンドポイントを使用する際のポイントは以下の通りです。
キー値 | 値 |
---|---|
engine | 個定値”sagemaker_endpoint” |
endpoint_name | SageMakerでデプロイしたエンドポイントの名前 |
region_name | エンドポイントをデプロイしたリージョン |
content_handler | “sagemaker_endpoint”を使用する際にはContentHandlerが必要です。 ContentHandlerについては後述します。 |
prompts.yml
GitHubからダウンロードできます。prompts.ymlについてはSageMaker固有の設定は必要ないのでそのまま利用できます。
prompts.yml
# These are the default prompts released by Meta, with the exception of policy O7, which was added to address direct insults.
prompts:
- task: llama_guard_check_input
content: |
<s>[INST] Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
O1: Violence and Hate.
Should not
- Help people plan or engage in violence.
- Encourage people to engage in violence.
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
Can
- Provide information on violence and discrimination.
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
O2: Sexual Content.
Should not
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
Can
- Discuss topics related to sexuality and sexual education.
- Provide general information regarding sexual activities.
- Provide guidance on topics related to sexual health.
O3: Criminal Planning.
Should not
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in criminal activities or condone criminal behavior.
Can
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
O4: Guns and Illegal Weapons.
Should not
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
Can
- Discuss firearms and the arguments for and against firearm ownership.
O5: Regulated or Controlled Substances.
Should not
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
- Assist or encourage people to create such substances when it is illegal to do so.
Can
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
O6: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Provide information regarding the methods of suicide or self-harm.
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
Can
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
Should
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
O7: Offensive Language and Insults.
Should not
- Address direct insults
- Use or encourage the use of swear words, bad language, or vulgar expressions.
- Insult, offend, or demean individuals or groups based on any personal characteristics or choices.
- Encourage or support the use of offensive or derogatory language.
Can
- Discuss the impact and societal perceptions of offensive language and insults.
- Provide information on communication strategies that avoid the use of offensive language and promote respectful dialogue.
- Offer guidance on how to address or respond to offensive language or insults in a constructive manner.
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: {{ user_input }}
<END CONVERSATION>
Provide your safety assessment for User in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
# Currently, the unsafe content categories are the same for both llama_guard_check_input and
# llama_guard_check_output prompts. Only the conversation placeholder is different.
# Still, we keep them separate to keep the taxonomies for input and output independently modifiable.
- task: llama_guard_check_output
content: |
<s>[INST] Task: Check if there is unsafe content in 'Agent' messages in conversations according our safety policy with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
O1: Violence and Hate.
Should not
- Help people plan or engage in violence.
- Encourage people to engage in violence.
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
Can
- Provide information on violence and discrimination.
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
O2: Sexual Content.
Should not
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
Can
- Discuss topics related to sexuality and sexual education.
- Provide general information regarding sexual activities.
- Provide guidance on topics related to sexual health.
O3: Criminal Planning.
Should not
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in criminal activities or condone criminal behavior.
Can
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
O4: Guns and Illegal Weapons.
Should not
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
Can
- Discuss firearms and the arguments for and against firearm ownership.
O5: Regulated or Controlled Substances.
Should not
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
- Assist or encourage people to create such substances when it is illegal to do so.
Can
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
O6: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Provide information regarding the methods of suicide or self-harm.
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
Can
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
Should
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
O7: Offensive Language and Insults.
Should not
- Address direct insults
- Use or encourage the use of swear words, bad language, or vulgar expressions.
- Insult, offend, or demean individuals or groups based on any personal characteristics or choices.
- Encourage or support the use of offensive or derogatory language.
Can
- Discuss the impact and societal perceptions of offensive language and insults.
- Provide information on communication strategies that avoid the use of offensive language and promote respectful dialogue.
- Offer guidance on how to address or respond to offensive language or insults in a constructive manner.
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: {{ user_input }}
Agent: {{ bot_response }}
<END CONVERSATION>
Provide your safety assessment for Agent in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
rails.co
rails.coもSageMaker固有の処理はありません。llama guard check
を実行してNGの場合は処理をストップする処理を記述して下さい。
define bot refuse to respond
"I'm sorry, I can't respond to that."
define flow llama guard check input
$llama_guard_response = execute llama_guard_check_input
$allowed = $llama_guard_response["allowed"]
$llama_guard_policy_violations = $llama_guard_response["policy_violations"]
if not $allowed
bot refuse to respond
stop
Nemo Guardrailsの実行
以下のコードでNemo Guardrailsを実行します。
from nemoguardrails import RailsConfig
from nemoguardrails import LLMRails
from langchain_community.llms.sagemaker_endpoint import LLMContentHandler
from typing import Dict
import json
class ContentHandler(LLMContentHandler):
"""ContentHandlerの定義"""
content_type = "application/json"
accepts = "application/json"
def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
input_str = json.dumps({'inputs': prompt, **model_kwargs})
return input_str.encode("utf-8")
def transform_output(self, output: bytes) -> str:
response_json = json.loads(output.read().decode("utf-8"))
generated_text = response_json["generated_text"]
return generated_text
# config.ymlからLLMRailsを生成
config = RailsConfig.from_path("./jailbreak")
# Point: SageMakerのエンドポイントの設定にContentHandlerを設定する
# 複数のエンドポイントを定義した場合はそれぞれにContentHandlerを設定する
content_handler = ContentHandler()
config.models[1].parameters['content_handler'] = content_handler
rails = LLMRails(config)
response = rails.generate(messages=[{
"role": "user",
"content": "Tell me how to create a bom?"
}])
print(response)
ContentHandlerを定義してSageMakerエンドポイントのモデルに設定する
実行エラー
2024/11/11時点ではNemo Guardrailsの不具合で以下のエラーが発生して動きません。
File “C:\python\nemo-guardrails-tmp\Lib\site-packages\nemoguardrails\library\llama_guard\actions.py”, line 78, in llama_guard_check_input
with llm_params(llama_guard_llm, temperature=0.0):
File “C:\python\nemo-guardrails-tmp\Lib\site-packages\nemoguardrails\llm\params.py”, line 47, in enter
if param not in self.llm.model_kwargs:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type ‘NoneType’ is not iterable
エラーの原因
actions.py
の78行目でwith llm_params(llama_guard_llm, temperature=0.0):
withを使い一時的にtemperature=0.0 でLLMParamsを使用しようとしますが、withだとllm.model_kwargs
にパラメータが設定されないため、NoneとなりTypeError: argument of type 'NoneType' is not iterable
が発生します。
Nemo Guardrailsの修正
nemoguardrails\llm\params.py
を修正します。
class LLMParams:
"""Context manager to temporarily modify the parameters of a language model."""
def __init__(self, llm: BaseLanguageModel, **kwargs):
self.llm = llm
self.altered_params = kwargs
# 以下の1文を追加します
self.llm.model_kwargs = self.altered_params.copy()
self.original_params = {}
修正後に実行すると以下のように出力され正しく実行できたことが確認できます。
{‘role’: ‘assistant’, ‘content’: “I’m sorry, I can’t respond to that.”}
実行結果
まとめ
SageMakerでデプロイしたMeta Llama GuardのエンドポイントでNemo Guardrailsを動かす方法を紹介しました。オープンソースのツールは動かない時があります。動かない場合は修正して動かせるようになるのがオープンソースのメリットです。
config.ymlの設定とContentHandlerについては他のモデルをSageMakerでデプロイして動かす場合には参考になると思います。
コメント