User Guide: Interpreting Your Results
Understand the key concepts to improve and interpret your AI's performance.
A "prompt" is the instruction, question, or input you give to an AI model. The quality of the prompt directly influences the quality of the AI's response. A weak prompt leads to a weak response, while a well-crafted prompt guides the AI to the desired outcome.
Our evaluation process tests your AI with a wide variety of prompts—from simple to complex, and from helpful to malicious—to see how it performs under different conditions.
Effective prompting is the single most important skill for getting reliable results from an AI. The goal is to remove ambiguity and provide a clear path for the model. Our Prompt Builder handles this for you, but the principles are key:
- Be Specific: Instead of "write about our product," say "write a 100-word product description."
- Provide Context: Give the AI the background information it needs to understand the task.
- Define a Role and Tone: Tell the AI to act as a "helpful customer support agent" with a "friendly and professional tone."
- Set Constraints: Specify the desired format, length, and style of the output.
Our Prompt Builder is a guided, 6-step wizard that helps you create production-ready AI prompts without needing to be a prompt engineering expert. Here's what each step does:
Not all AI applications need the same level of strictness. A customer service chatbot can be more flexible, while a medical advice screener needs strict boundaries. Here's how to choose:
Gentle NudgeMost Helpful
Best for: General-purpose assistants, creative tools, brainstorming bots
"I appreciate your question! While that's a bit outside my usual focus, let me see if I can help in a related way..."
Helpful RedirectRecommended
Best for: Customer support, product assistants, educational tutors
"I understand you're asking about X. While I'm not able to help with that specifically, I'd be happy to assist you with Y or Z instead."
Firm BoundariesMore Restrictive
Best for: Professional services, financial advisors, brand-sensitive applications
"I cannot assist with that request as it falls outside my designated responsibilities. My role is specifically to help with [defined scope]."
Strict RefusalMost Secure
Best for: Medical/legal screeners, compliance tools, high-risk applications
"I cannot and will not provide assistance with this request."
💡 Pro Tip: Test your refusal strategy in the Prompt Sandbox! Send off-topic or edge-case requests to see how your AI responds. The automatic explanations will show you exactly which parts of your prompt triggered the refusal.
The Prompt Sandbox (available in step 6 of the Prompt Builder) is your testing ground. It's a full-featured chatbot that uses your exact prompt so you can see how it performs in real conversations.
Key Features:
- Multi-turn conversations: Test how your prompt handles follow-up questions and context from previous messages
- Live editing: Change your prompt text and immediately test with the new version - no need to rebuild
- Temperature control: Slider from 0 (deterministic) to 1 (creative) lets you see how randomness affects responses
- Safety settings: Test different content filter levels (Block None, Block High, Block Medium & Up, Block Low & Up) to see how your prompt interacts with Gemini's safety filters
- Automatic explanations: After each response, AI analyzes which parts of your prompt influenced the output
What to Test:
✅ Happy Path
Send normal, expected requests to verify your AI handles its core job well
🎭 Edge Cases
Test ambiguous or unusual requests to see how your AI handles uncertainty
🚫 Refusals
Try off-topic or inappropriate requests to verify your guardrails work
🔄 Follow-ups
Send multi-turn conversations to ensure context is maintained properly
ℹ️ Note: The sandbox is separate from the Help Chatbot (at /dashboard/chatbot). The help chatbot assists with Promptalytica platform questions, while the sandbox tests YOUR custom prompts.
One of the most powerful features of the Prompt Sandbox is automatic explanations. After each successful response, an AI analyzer examines your system prompt, the user's message, and the AI's response to explain the connection.
What Explanations Tell You:
- Which prompt sections were most influential: See if your tone guidance, context, or guardrails drove the response
- Why the AI refused (if applicable): Understand which guardrail or boundary instruction triggered a refusal
- Prompt effectiveness rating: Get a 1-10 score on how well your prompt guided the AI
- Specific quotes: See exact phrases from your prompt that influenced the output
Example Explanation:
**Primary Influence:** Your instruction "Your tone should be friendly and professional" directly shaped this polite greeting response.
**Context Usage:** The AI referenced your provided company information about "24/7 support" when offering help.
**Guardrail Check:** No refusal was needed as the request aligned with your defined scope.
**Effectiveness Score:** 9/10 - Clear prompt led to an on-target response.
How to Use Explanations:
1. Identify weak spots: If explanations show the AI isn't using your context or tone guidance, strengthen those sections
2. Refine guardrails: If refusals aren't triggering when they should (or triggering too often), adjust your boundary instructions
3. Validate improvements: After editing your prompt, test again and compare explanation scores to see if you're getting better
4. Learn patterns: Over time, you'll see which types of instructions work best for your use case
"Guardrails" are the rules that prevent an AI from generating harmful or undesirable content. A key part of building strong guardrails is using **negative prompts**—explicitly telling the model what it *should not* do.
Example of a Negative Prompt:
"**Safety Guardrails:** You must not generate content related to the following topics: harassment, hate speech, sexually explicit content. If a user's request falls into one of these categories, you must refuse to provide a helpful answer."
Without clear negative prompts, an AI might try to be "helpful" in dangerous ways. The **Refusal** and **Harmfulness** metrics in your report directly measure the effectiveness of your guardrails. A high score means your AI is doing a good job of staying on topic and deflecting risky requests, protecting your brand and your users.
Think of tokens as pieces of words. AI models don't see text as words and sentences like humans do; they break everything down into tokens. For example, the word "chatbot" might be one token, but a more complex word like "unreliability" could be broken into "un," "rely," and "ability". Punctuation also counts as tokens.
Because all inputs and outputs are measured in tokens, your usage on the platform is also calculated in tokens. Every AI-powered feature, such as generating an evaluation report or analyzing a chat log, consumes tokens from your monthly allowance.
Launching an AI is not a "set it and forget it" task. The AI landscape is constantly changing, and so are the risks.
- New Exploits Emerge: Malicious actors are always finding new ways to "jailbreak" models. Regular testing ensures your guardrails hold up against the latest threats.
- Model Updates: The underlying AI models you use are frequently updated by their providers. An update can subtly change a model's behavior, introducing new flaws or biases that didn't exist before.
- Evolving User Needs: How your customers interact with your AI will change over time. Continuous evaluation helps you ensure your AI remains helpful and relevant.