Governance at Scale
As artificial intelligence reshapes healthcare operations—from member outreach to risk adjustment—health plans face a pivotal question: how to harness large language models (LLMs) safely and strategically. The answer lies in robust governance that tiers each model based on its capacity, accuracy, bias risk, and regulatory exposure.
The Landscape: Capabilities Without Guarantees
Large language models have emerged as versatile tools capable of generating fluent, contextually rich content and responding to queries across a wide spectrum of domains. Some models excel in conversational fluency, while others focus on delivering traceable, source-backed answers. However, fluency and technical metrics like perplexity—which measure how well a model predicts the next word in a sequence—do not guarantee factual reliability, safety, or fairness.
While these models demonstrate strong baseline performance, they can generate incomplete, outdated, or hallucinated content. A 2023 Stanford study found that some models hallucinated in over 20% of healthcare-related outputs, particularly when asked to summarize or recommend treatments1. These shortcomings make rigorous evaluation and governance essential when applying LLMs in healthcare, where the stakes include patient safety, regulatory compliance, and operational integrity.
Strengths and Limitations in a Healthcare Setting
Clinical studies and operational evaluations suggest that general-purpose LLMs show promising results in areas like patient communication, decision support, and knowledge synthesis. However, assessments also reveal inconsistencies in accuracy, response variability, and hallucination of data or references. A Mayo Clinic review found that only 59% of model-generated clinical advice aligned with actual medical guidelines when left unchecked 2. Models often struggle with nuance in medical context or decision-making logic, and may underperform in real-world clinical alignment.
These limitations reinforce a critical truth: even the most sophisticated LLMs must be carefully validated and monitored, particularly when integrated into healthcare workflows that impact diagnoses, treatments, or member experiences.
A Regulatory Horizon: LLMs as High-Risk Medical Tools
Governance is becoming non-negotiable. The FDA’s AI/ML Action Plan calls for lifecycle monitoring, model versioning, and real-world performance auditing. The European Union’s AI Act classifies healthcare-related AI as "high-risk," and evolving HIPAA interpretations increasingly cover algorithmic transparency and data traceability.
A Deloitte report from 2023 found that 71% of healthcare executives believe AI regulations will significantly affect future digital strategies, particularly around LLM use3. For health plans, this means implementing a rigorous framework that risk-tiers LLMs based on their application, capability, and potential for harm.
A Four-Tier Risk Framework for Health Plan LLMs
Mizzeto proposes a structured tiering model aligned with payer priorities in compliance, automation, and member impact.
Tier 1: Advisory or Information Retrieval
Tier 1 includes models used for non-clinical functions such as internal knowledge bases, FAQ bots, and general education. These applications typically present minimal risk, as they do not influence care decisions or involve sensitive data handling. The primary concerns here are outdated content and potential inaccuracies, which can usually be mitigated with well-defined content review cycles.
Governance strategies at this level should focus on basic controls: logging user interactions, conducting periodic accuracy audits, and performing Privacy Impact Checks (PICs) to ensure no protected health information (PHI) is inadvertently introduced. These models are well suited for provider self-service portals, employee onboarding, and low-risk internal search applications.
Tier 2: Administrative Automation
Tier 2 applies to models assisting with operational workflows such as claims triage, prior authorization support, and provider communications. These models play a more active role in administrative decision-making, which introduces a higher risk of downstream impact. Errors at this level could lead to incorrect approvals, delays in processing, or provider dissatisfaction.
Due to this elevated risk, governance must include human-in-the-loop oversight for high-stakes outputs. Logs should capture both prompts and model responses, and performance monitoring should track error rates, bias, and hallucination frequency. Following NIST-aligned frameworks, health plans should incorporate calibration tests to measure overconfidence in outputs and reduce automation bias.
Tier 3: Clinical-Support Applications
This tier includes use cases that directly assist clinical staff or members in understanding care options, interpreting medical information, or identifying risk factors. These models often influence—but do not finalize—care decisions. Because they operate in a high-stakes domain, even small inaccuracies or biases can disproportionately affect health outcomes or erode trust.
Effective governance in Tier 3 requires multiple layers of human review, ideally involving clinicians who can assess content accuracy and relevance. Models should be stress-tested using adversarial techniques to detect vulnerabilities such as data poisoning or performance degradation over time. Additionally, governance must track model provenance, enforce version control, and implement audit trails aligned with FDA and NIST guidelines.
Tier 4: Regulated Diagnostic or Therapeutic Support
The highest tier is reserved for models that directly assist with diagnosis, treatment planning, or other regulated medical functions. These systems are considered Software as a Medical Device (SaMD) and must comply with FDA clearance pathways, such as 510(k) or De Novo classifications. They are subject to the highest scrutiny due to their potential to directly impact patient care.
Governance in Tier 4 must be rigorous and comprehensive. This includes validated performance benchmarks, adherence to GxP practices, explainability standards, and the ability to override model recommendations in real time. These systems also require continuous real-world monitoring to ensure safety and effectiveness, as well as extensive bias testing to ensure equitable performance across diverse populations. Only models that have met these stringent requirements should be deployed in high-impact diagnostic or therapeutic environments.
Why Tiering Matters for Health Plans
A tiered governance model offers multiple strategic advantages. It enables fast rollout of low-risk tools while dedicating due diligence to high-risk applications. It ensures compliance with regulatory bodies like the FDA and aligns with global standards such as the EU AI Act. Most importantly, it focuses oversight where it matters most—on applications where errors can cause harm.
Health plans can operationalize this framework by cataloging LLM use cases and mapping them to the appropriate tier. Governance committees—spanning compliance, clinical, and IT—can establish playbooks, monitoring protocols, and update cadences. Dashboards tracking hallucination rates, bias drift, and PHI leakage support transparency and continuous improvement. This governance strategy dovetails with Mizzeto’s core philosophy: Protect People, Prioritize Equity, and Promote Health Value.
Additionally, implementing this model encourages a culture of responsible innovation. It gives organizations a structured way to experiment with new LLM applications while minimizing exposure to risk. Teams across legal, compliance, product, and data science can speak a common governance language, ensuring that development velocity doesn’t outpace safety and trust requirements.
Mizzeto has already begun implementing this governance model at scale for a Fortune 500 healthcare company, supporting LLM deployment across multiple departments including claims operations, care coordination, and digital member services. By embedding tiered oversight into AI adoption, Mizzeto has helped this client reduce operational risk, meet regulatory expectations, and confidently scale their use of generative AI while keeping patient safety and data integrity at the forefront.
The Road Ahead
As LLM adoption accelerates, governance frameworks must evolve. Explainable AI is essential for clinician trust. Bias detection mechanisms are critical for fair outcomes. Guardrails against data poisoning and alignment with NIST/WHO guidelines will future-proof these systems.
Notably, a McKinsey report found that 60% of healthcare leaders plan to expand generative AI initiatives in 2024, but only 21% have implemented formal governance structures to manage associated risks4. These gaps underscore the need for structured oversight like the tiering approach outlined here.
Health plans are at a turning point. Poorly governed AI can result in clinical missteps, regulatory fines, or reputational harm. Smart governance, on the other hand, transforms risk into strategic advantage. By stratifying LLMs into risk-aligned tiers, Mizzeto empowers health plans to deploy AI responsibly, drive innovation, and safeguard patient trust. Governance isn’t just compliance—it’s the infrastructure for sustainable, scalable AI success in healthcare.
If your organization is navigating the complexities of LLM deployment and seeking a structured, proven approach to governance, Mizzeto is here to help. With deep experience implementing tiered risk models for Fortune 500 healthcare clients, we understand how to balance innovation with compliance, safety, and ROI. Whether you're exploring administrative use cases or deploying LLMs in clinical environments, our team can guide you through every step of responsible integration. Please reach out to Mizzeto to learn how we can help you properly risk-tier your LLMs and deploy them with confidence.
1AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries
2Medical Hallucinations in Foundation Models and Their Impact on Healthcare
3About 40% of health execs say generative AI pays off, Deloitte finds
4Generative AI in healthcare: Current trends and future outlook