
When deploying AI into the real world, safety isn’t optional—it’s essential. OpenAI places strong emphasis on ensuring that applications built on its models are secure, responsible, and aligned with policy. This article explains how OpenAI evaluates safety and what you can do to meet those standards.
Beyond technical performance, responsible AI deployment requires anticipating potential risks, safeguarding user trust, and aligning outcomes with broader ethical and societal considerations. OpenAI’s approach involves continuous testing, monitoring, and refinement of its models, as well as providing developers with clear guidelines to minimize misuse. By understanding these safety measures, you can not only build more reliable applications but also contribute to a healthier AI ecosystem where innovation coexists with accountability.
Why Safety Matters
AI systems are powerful, but without guardrails they can generate harmful, biased, or misleading content. For developers, ensuring safety is not just about compliance—it’s about building applications that people can genuinely trust and benefit from.
- Protects end-users from harm by minimizing risks such as misinformation, exploitation, or offensive outputs
- Increases trust in your application, making it more appealing and reliable for users
- Helps you stay compliant with OpenAI’s use policies and broader legal or ethical frameworks
- Prevents account suspension, reputational damage, and potential long-term setbacks for your business
By embedding safety into your design and development process, you don’t just reduce risks—you create a stronger foundation for innovation that can scale responsibly.
Core Safety Practices
Moderation API Overview
OpenAI offers a free Moderation API designed to help developers identify potentially harmful content in both text and images. This tool enables robust content filtering by systematically flagging categories such as harassment, hate, violence, sexual content, or self-harm, enhancing the protection of end-users and reinforcing responsible AI use.
Supported Models- Two moderation models can be used:
- omni-moderation-latest: The preferred choice for most applications, this model supports both text and image inputs, offers more nuanced categories, and provides expanded detection capabilities.
- text-moderation-latest (Legacy): Only supports text and provides fewer categories. The omni model is recommended for new deployments as it offers broader protection and multimodal analysis.
Before deploying content, use the moderation endpoint to assess whether it violates OpenAI’s policies. If the system identifies risky or harmful material, you can intervene by filtering the content, stopping publication, or taking further action against offending accounts. This API is free and continuously updated to improve safety.
Here’s how you might moderate a text input using OpenAI’s official Python SDK:
from openai import OpenAI
client = OpenAI()
response = client.moderations.create(
model="omni-moderation-latest",
input="...text to classify goes here...",
)
print(response)
The API will return a structured JSON response indicating:
- flagged: Whether the input is considered potentially harmful.
- categories: Which categories (e.g., violence, hate, sexual) are flagged as violated.
- category_scores: Model confidence scores for each category (ranging 0–1), indicating likelihood of violation.
- category_applied_input_types: For omni models, shows which input type (text, image) triggered each flag.
Example output might include:
{
"id": "...",
"model": "omni-moderation-latest",
"results": [
{
"flagged": true,
"categories": {
"violence": true,
"harassment": false,
// other categories...
},
"category_scores": {
"violence": 0.86,
"harassment": 0.001,
// other scores...
},
"category_applied_input_types": {
"violence": ["image"],
"harassment": [],
// others...
}
}
]
}
The Moderation API can detect and flag multiple content categories:
- Harassment (including threatening language)
- Hate (based on race, gender, religion, etc.)
- Illicit (advice for or references to illegal acts)
- Self-harm (including encouragement, intent, or instruction)
- Sexual content
- Violence (including graphic violence)
Some categories support both text and image inputs, especially with the omni model, while others are text-only.
Adversarial Testing
Adversarial testing—often called red-teaming—is the practice of intentionally challenging your AI system with malicious, unexpected, or manipulative inputs to uncover weaknesses before real users do. This helps expose issues like prompt injection (“ignore all instructions and…”), bias, toxicity, or data leakage.
Red-teaming isn’t a one-time activity but an ongoing best practice. It ensures your application stays resilient against evolving risks. Tools like deepeval make this easier by providing structured frameworks to systematically test LLM apps (chatbots, RAG pipelines, agents, etc.) for vulnerabilities, bias, or unsafe outputs.
By integrating adversarial testing into development and deployment, you create safer, more reliable AI systems ready for unpredictable real-world behaviors.
Human-in-the-Loop (HITL)
When working in high-stakes areas like healthcare, finance, law, or code generation, it is important to have a human review every AI-generated output before it is used. Reviewers should also have access to all original materials—such as source documents or notes—so they can check the AI’s work and ensure it is trustworthy and accurate. This process helps catch mistakes and builds confidence in the reliability of the application.
Prompt Engineering
Prompt engineering is a key technique to reduce unsafe or unwanted outputs from AI models. By carefully designing prompts, developers can limit the topic and tone of the responses, making it less likely for the model to generate harmful or irrelevant content.
Adding context and providing high-quality example prompts before asking new questions helps guide the model toward producing safer, more accurate, and appropriate results. Anticipating potential misuse scenarios and proactively building defenses into prompts can further protect the application from abuse.
This approach enhances control over the AI’s behavior and improves overall safety.
Input & Output Controls
Input & Output Controls are essential for enhancing the safety and reliability of AI applications. Limiting the length of user input reduces the risk of prompt injection attacks, while capping the number of output tokens helps control misuse and manage costs.
Wherever possible, using validated input methods like dropdown menus instead of free-text fields minimizes the chances of unsafe inputs. Additionally, routing user queries to trusted, pre-verified sources—such as a curated knowledge base for customer support—instead of generating entirely new responses can significantly reduce errors and harmful outputs.
These measures together help create a more secure and predictable AI experience.
User Identity & Access
User Identity & Access controls are important to reduce anonymous misuse and help maintain safety in AI applications. Generally, requiring users to sign up and log in—using accounts like Gmail, LinkedIn, or other suitable identity verifications—adds a layer of accountability. In some cases, credit card or ID verification can further lower the risk of abuse.
Additionally, including safety identifiers in API requests enables OpenAI to trace and monitor misuse effectively. These identifiers are unique strings that represent each user but should be hashed to protect privacy. If users access your service without logging in, sending a session ID instead is recommended. Here is an example of using a safety identifier in a chat completion request:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "This is a test"}
],
max_tokens=5,
safety_identifier="user_123456"
)
This practice helps OpenAI provide actionable feedback and improve abuse detection tailored to your application’s usage patterns.
Transparency & Feedback Loops
To maintain safety and improve user trust, it is important to give users a simple and accessible way to report unsafe or unexpected outputs. This could be through a clearly visible button, a listed email address, or a ticket submission form. Submitted reports should be actively monitored by a human who can investigate and respond appropriately.
Additionally, clearly communicating the limitations of the AI system—such as the possibility of hallucinations or bias—helps set proper user expectations and encourages responsible use. Continuous monitoring of your application in production allows you to identify and address issues quickly, ensuring the system stays safe and reliable over time.
How OpenAI Assesses Safety
OpenAI assesses safety across several key areas to ensure models and applications behave responsibly. These include checking if outputs produce harmful content, testing how well the model resists adversarial attacks, ensuring limitations are clearly communicated, and confirming that humans oversee critical workflows. By meeting these standards, developers increase the chances their applications will pass OpenAI’s safety checks and successfully operate in production.
With the release of GPT-5, OpenAI introduced safety classifiers that classify requests based on risk levels. If your organization repeatedly triggers high-risk thresholds, OpenAI may limit or block access to GPT-5 to prevent misuse. To help manage this, developers are encouraged to use safety identifiers in API requests, which uniquely identify users (while protecting privacy) to enable precise abuse detection and intervention without penalizing entire organizations for individual violations.
OpenAI also applies multiple layers of safety checks on models, including guarding against disallowed content like hateful or illicit material, testing against adversarial jailbreak prompts, assessing factual accuracy (minimizing hallucinations), and ensuring the model follows hierarchy in instructions between system, developer, and user messages. This robust, ongoing evaluation process helps OpenAI maintain high standards of model safety while adapting to evolving risks and capabilities
Conclusion
Building safe and trustworthy AI applications requires more than just technical performance—it demands thoughtful safeguards, ongoing testing, and clear accountability. From moderation APIs to adversarial testing, human review, and careful control over inputs and outputs, developers have a range of tools and practices to reduce risk and improve reliability.
Safety isn’t a box to check once, but a continuous process of evaluation, refinement, and adaptation as both technology and user behavior evolve. By embedding these practices into development workflows, teams can not only meet policy requirements but also deliver AI systems that users can genuinely rely on—applications that balance innovation with responsibility, and scalability with trust.