This tool calls OpenAI's
Moderation API, which flags content based on safety categories like violence, harrassment, self-harm, and more. This API exposes the same service OpenAI uses to moderate ChatGPT. This is where the warnings you get on the website come from (and these warnings are completely unrelated to refusals, which are simply trained into the model). Note that recently, sometimes the results do not perfectly align. It used to be the case that it was 100% accurate on true/false even to multiple decimal places, but who knows what adjustments they're doing behind the scenes.
-
Orange Warnings: Used to be triggered when any category flips to "True." Was harmless, and was removed in early 2025.
-
Red Warnings: Tied specifically to the "sexual/minors" category and nothing else. Historically triggered on "True", but recently the exact relationship is not clear. In any case, reds hide the message, and may lead to warning emails if they trigger on your requests (but not responses - reds on responses don't matter). Multiple emails may result in a ban.
Learn more in the
OpenAI Moderation API Guide.