I used to add soft warnings to my AI pipelines. A check would run, find something questionable, print a yellow note, and let the output ship anyway. Within a week I was ignoring every one of them. The pipeline had checks. The checks changed nothing. I was building theater that made me feel responsible without making me actually responsible.
Now every quality check I add to an AI pipeline is binary. It either blocks the output from shipping, or I do not add the check at all. There is no middle state. A gate that lets you proceed with a warning is not a gate. It is decoration that trains you to stop reading warnings, and once you stop reading them you are in a worse position than having no checks at all, because you believe you have coverage while having none.
The reason this matters more for AI-generated output than for hand-written work is volume. A person producing one deliverable per day will notice a yellow advisory and might act on it. A pipeline producing a dozen outputs per day generates a dozen advisories, and nobody is going to read a wall of notes with the same attention every morning. What actually happens is that the first few warnings get investigated, then the list grows, then it becomes a backlog nobody owns, and then the next genuinely bad output sails through because the signal drowned in noise weeks ago. The only sustainable quality mechanism at that throughput is a gate that physically stops the bad output from reaching the next step. If the check fails, the pipeline halts, the fix becomes the only way forward, and the output does not ship broken while I convince myself I will get to it later.
A blocking gate is a script that exits nonzero and halts the pipeline. The check itself can be anything that matters enough to stop for. In my pipelines I have gates that look for a required pattern in the output, gates that confirm factual claims have a source. The specifics vary by what I am shipping. What does not vary is the contract. Fail means stop.
The temptation to build soft warnings comes from wanting coverage without friction. A hard gate means the pipeline stops, and stopping means I have to fix something before anything else moves. That feels expensive in the moment. It feels like the gate is in the way. But the alternative is shipping something broken and discovering it after the fact, which costs more every single time. I have never once looked at a gate that caught a real problem and thought it was not worth the interruption. I have regularly looked at warnings I scrolled past and wished they had stopped me. The friction of stopping is the point. It forces a decision at the exact moment when the decision is cheapest to make.
The hard question is what to do about checks that resist binary measurement. Tone is not naturally pass/fail. Factual accuracy has gradients. My answer is to force the binary anyway, even if the threshold is imperfect. A tone check that blocks anything scoring below a 6 out of 10 is better than a tone check that prints a number and ships regardless. Will it sometimes block output that was fine? Yes. Will I override it occasionally? Yes. Both of those costs are lower than the slow erosion that happens when every check is advisory. I would rather tune a threshold over time than train myself to ignore it.
The override itself matters. When a gate blocks output I judge to be acceptable despite failing the check, I override it with a deliberate, logged decision. That is different from a warning in one critical way. The override requires me to engage with the failure, decide this specific case is an exception, and record that I made that call. A warning requires nothing. It scrolls by. Nobody ever had to justify scrolling. The override mechanism keeps the gate credible while acknowledging that no automated check is perfect. The moment you remove the friction of stopping, you remove the reason the gate exists. Every gate I trust today has a short list of past overrides beside it, and that list is what proves the gate is doing work.
I build fewer checks than I used to. Each one has to earn its place by being worth stopping for. If a quality dimension is not important enough to halt the pipeline when it fails, it is not important enough to check, because a check I will not act on is a check that teaches me to ignore all the others.