Toxicity detection

The Toxicity detection guardrail can be applied to both user messages (Input) and the model's responses (Output).

Toxicity Detection (Input): Identifies toxic language in user messages and prevents the model from responding to such inputs.
Toxicity Detection (Output): Filters out any toxic language in the model's responses before they are delivered to users, ensuring respectful communication.

Example:

Configuration:

Parameters required to set up a toxicity detection guardrail:

Threshold: The confidence threshold for detecting toxicity, set by model inference. It defaults to 0.5 and is a value between 0 and 1.
Validation method: Determines whether toxicity is assessed at the sentence level or across the full text. Options include "sentence" or "full," with "sentence" as the default.

Using the UI

Using the SDK

Create Guardrail

from superwise_api.models.application.application import ApplicationToxicityGuard

toxicity_guard = ApplicationToxicityGuard(
    name="Rule name",
    tag= "input", #can be "input" or "output"
    type="toxicity",
    threshold=0.5,
    validation_method="sentence"  # Can be "sentence" or "full"
)

Add guards to application

app = sw.application.put(
    str(app.id),
    additional_config=AdvancedAgentConfig(tools=[]),
    llm_model=model, 
    prompt=None, 
    name="App name",
    show_cites = True,
    guards=[toxicity_guard]
)

Updated 5 months ago