Who Holds the Therapeutic Space?

A companion visualization mapping venture capital relationships between insurers and digital mental health platforms is available here. The reporting in this piece draws on that research.

A patient says something they have never said aloud. The therapist holds it, not fixing it, not redirecting, just bearing witness. Something shifts, and it is not because of a particular technique. It is because two people are sitting with what is present. This is what a substantial body of research has pointed to consistently: not technique, not diagnosis, not symptom score, but relationship.

The National Quality Forum, now affiliated with the Joint Commission since 2023, is leading a multi-stakeholder effort to develop standardized quality measures for outpatient psychotherapy. Headway is the first launch partner for the behavioral health component of this work. The goal, as NQF's president has described it, is for multiple payers in the same market to use the same measures consistently to hold providers in their networks accountable. The measures are not yet mandated, but they are designed to be adopted across payers and aligned with value-based care arrangements. Headway, which connects patients with therapists and collects a fee from insurers for each session billed through its platform, counts several of those same insurers among its investors.

The tool being developed will assess depression and anxiety symptoms alongside what the initiative calls functional outcomes: quality of life, personal relationships, employment, and daily functioning. These are reasonable domains to track. But when patient-reported outcome endpoints are aggregated and attributed to the individual clinician without robust risk adjustment, they conflate patient characteristics with clinical quality. When encoded as accountability standards enforced by the payers funding this initiative, such metrics will not merely describe care. They will shape it.

That is what this piece is about.


The Boom That Built This Problem

COVID-era demand and the rapid shift to teletherapy coincided with a major funding surge in digital mental health. U.S. mental health tech funding reached $4.5 billion in 2021, the highest on record according to CB Insights. Platforms including Alma and Headway raised large rounds during this period. Among the investors in those rounds were insurer-affiliated venture groups: Alma's disclosed investors include Cigna Ventures and Optum Ventures; Headway's investors include Health Care Service Corporation (HCSC), which also holds a strategic partnership that expanded Headway's coverage to its member population. Alma has since entered into an agreement to be acquired by Spring Health, with the transaction expected to close in Q2 2026.

The result is that insurer-affiliated venture entities hold equity stakes in the platforms that connect therapists to insured patients, while their parent organizations operate health plans that set coverage policies and reimbursement terms in those same markets. This is the structural backdrop against which the current quality measurement initiative is unfolding.


The Dodo Bird Verdict

In 1975, Luborsky and colleagues reviewed comparative psychotherapy trials and arrived at a finding that subsequent meta-analyses have largely supported: on average, differences in outcomes between bona fide psychotherapy approaches are small relative to their shared overall effectiveness. CBT versus psychodynamic therapy versus interpersonal therapy versus solution-focused therapy. The techniques differ substantially. The average outcome differences do not.

This is called the dodo bird verdict, after the dodo's pronouncement in Alice in Wonderland that all have won and all must have prizes. The verdict has accumulated support across a substantial body of work, including Wampold and colleagues' 1997 meta-analysis, which found no significant differences in outcomes across bona fide psychotherapy approaches. Dismantling studies point in the same direction: when researchers remove specific components from treatment packages and test whether outcomes suffer, they usually do not. A meta-analysis by Ahn and Wampold (2001) found no significant differences between full treatments and stripped-down versions, and an updated meta-analysis by Bell, Marcus, and Goodlad (2013) extended that result across 66 studies. If specific techniques were the primary drivers of change, removing them should matter more than it appears to.

What consistently predicts outcome is what the field calls common factors: elements shared across approaches that drive improvement regardless of the specific model being applied. These include the therapeutic alliance, empathy, positive expectation, agreement on goals, shared understanding of the treatment rationale, corrective emotional experiences, and the therapeutic ritual itself. Analogues to several of these factors appear in medicine more broadly through the placebo response, suggesting they reflect something fundamental about how healing contexts work rather than anything specific to psychotherapy's techniques. Meta-analyses show a moderate alliance-outcome correlation, accounting for roughly 8 percent of outcome variance in the most comprehensive meta-analysis to date (Horvath et al., 2011). Common factors as a whole have been estimated to account for the substantial majority of outcome variance, with specific techniques likely accounting for considerably less (Luborsky et al., 1975; Wampold & Imel, 2015; Wampold et al., 1997; Ahn and Wampold, 2001; Bell et al., 2013; Cuijpers et al., 2019; Frank and Frank, 1991; Goldberg, 2022).

This matters directly for what the NQF initiative is designed to measure. Common factors account for the substantial majority of outcome variance, and those factors are distributed across patient characteristics, the therapeutic relationship, and the context of treatment itself. Because patient-level variance substantially exceeds therapist-level variance in most studies (Wampold & Imel, 2015), a system that attributes outcome movement to therapist performance is likely measuring patient mix as much as clinical skill. Without robust risk adjustment, the accountability framework does not solve the attribution problem. It encodes it.


What Actually Happens in the Room

The common factors literature has been instrumental in elucidating what drives outcomes across treatments and patient populations. The integration of Control-Mastery Theory can help to specify how those factors work for a given patient, at the right time, moving beyond group-level averages and into the moment-to-moment work of the room. Knowing that alliance predicts outcomes on average is a starting point; what a clinician also needs is a way of understanding the particular person sitting across from them, whose history has shaped what trust looks and feels like, whose expectations about relationships were formed under conditions that may have made closeness complicated, and who may be watching carefully to see whether this relationship unfolds differently than others have. Recent theoretical work integrating predictive processing with Control-Mastery Theory (Li et al., 2025) may offer a more granular account of how those relational processes unfold for a given patient, and why they may be poorly visible to measurement systems focused on symptom endpoints.

What that account describes, at the level of the clinical encounter, is something most experienced therapists recognize. A patient whose history has made trust difficult does not simply decide to trust a new therapist because the therapist is trustworthy. They find out, gradually and often indirectly, through a series of moments that either confirm or challenge what they have come to expect from relationships. They may become vulnerable at a moment that previously led to dismissal, and watch to see what happens. They may recreate a familiar dynamic, placing the therapist in a role the patient has experienced before, to see whether the response differs. When the therapist reads these moments accurately and responds in a way that does not confirm the patient's expectation, something shifts, not dramatically, but perceptibly. The expectation loosens slightly. The patient takes in a piece of disconfirming evidence and begins, slowly, to revise their working model of what a relationship can be.

This process is neither linear nor rapid. It depends on the accumulation of disconfirmatory experiences across many sessions and in relationships outside the room, on the therapist's ability to recognize what is being tested in a given moment, and on conditions within the patient that allow new relational information to be registered.

This is also why symptom movement can lag so far behind what is actually happening clinically. A patient may spend months in a process of developing enough safety to begin engaging differently, months during which their PHQ-9 score looks essentially unchanged. The internal work, the slow accumulation of corrective relational experience, the gradual shift in what feels possible in a relationship, may be proceeding meaningfully while producing no visible signal on the measures a quality system would track. A therapist working carefully through this process may be indistinguishable, on a symptom checklist, from one who is not working at all.

This is not a peripheral concern. The common factors research reviewed earlier points consistently toward the therapeutic relationship, patient characteristics, and the context of treatment as the primary drivers of outcome. What Control-Mastery Theory adds is an account of how those factors operate at the level of a specific person, in a specific moment, across a specific history. Neither the common factors nor the relational processes CMT describes are incidental to treatment. They are treatment. The tradition has produced a substantial body of process research, including measures of attunement, patient coaching, pathogenic belief revision, and markers of relational progress, precisely because this work can be tracked when the instrumentation is designed to see it. The question is not whether it can be measured. It is whether the measurement strategy being developed by the NQF and its payer-affiliated partners is designed to see what the evidence says it should be looking for.


What the Measures Will Favor

Different psychotherapy approaches have different targets and different time horizons. Some are structured and time-limited, with explicit symptom and functional goals that are typically assessed over weeks or months. Others are oriented toward longer-term changes in how patients understand themselves and relate to others, changes that for some approaches and some presentations may never map cleanly onto symptom scores at all. The spectrum of approaches is wide, and the distinction that matters here is not about which involve relationship, since all effective therapy involves relationship, but about what the work is primarily targeting and over what timeline meaningful change is expected to occur.

A measurement system centered on short-term symptom and functional recovery will naturally align better with approaches whose outcome trials are structured around those endpoints. That is a structural feature of metric design, not a claim about comparative effectiveness. For some approaches, the language of the quality system and the language of the clinical work are the same: symptoms are tracked in session, discussed explicitly with the patient, and used to structure treatment goals. When that is the case, the measurement is not just observing treatment, it is embedded in it, and repeated administration of the same measure introduces statistical artifacts, including regression to the mean, that can produce apparent improvement independent of clinical change (Barnett et al., 2005). For other approaches, symptoms may be neither the primary target nor the primary currency of clinical communication, and the metric is simply not designed to see what the work is doing. The deeper issue is not which approach is more effective. It is that careful work and ineffective work may look identical on paper when the measurement system lacks visibility into the mechanism of change.

There is also a level-of-inference problem. A system that applies group-derived benchmarks to individual clinicians is making a leap the evidence does not automatically license: relationships observed at the group level can diverge markedly from what holds within a given person over time, and within-person variability can often exceed between-person variability (Fisher et al., 2018). Multilevel models can partition these sources of variance, but they do not establish group-to-individual generalizability. That must be explicitly tested rather than assumed.

When reimbursement or network eligibility depends on demonstrating symptom improvement within a defined window, clinicians may face structural incentives to favor patients most likely to show rapid gains. Evidence from pay-for-performance programs in medicine suggests these dynamics are not hypothetical: in cardiac surgery, public reporting of mortality outcomes was followed in some cases by concerns that surgeons might avoid operating on the highest-risk patients, and similar patterns have been observed in oncology and primary care when reimbursement or reputation is tied to performance metrics (Dranove et al., 2003; Werner & Asch, 2005). Such incentive structures may inadvertently favor patients with a higher likelihood of circumscribed, rapid symptom reduction, and work against those whose presentations are ambiguous, chronic, complex, or slow to respond to intervention regardless of treatment quality.

In practice, most people who seek outpatient therapy do not fit the profile that produces rapid metric movement. They present with layered histories, persistent concerns that are not always reducible to functional impairment, and ways of relating to themselves and others that are less responsive to brief intervention, regardless of how they score on a checklist at intake. Research suggests that a meaningful course of therapy frequently extends well beyond brief treatment, often to a year or longer, regardless of modality (Wampold & Imel, 2015). The patients and presentations for whom this kind of sustained work is most warranted are also the ones least likely to produce short-term metric movement that a quality system of this design can recognize. Without careful risk adjustment and appropriate time horizons, such a system may measure patient mix more reliably than it measures clinical skill.

What Is Already Happening to Therapists

The measurement initiative is unfolding against a backdrop of long-standing and well-documented conflict between insurers and mental health providers.

According to APA's 2024 Practitioner Pulse Survey, 34 percent of practicing psychologists do not accept insurance. Of those who have left or never joined insurance panels, 82 percent cited low reimbursement rates, 62 percent cited administrative burden, and 52 percent cited unreliable payment. APA has reported widespread concerns about post-payment audits and recoupments, in which insurers review services rendered months or years earlier and seek to recover payments they deem retroactively unjustified. Providers have also reported pressure from non-clinician reviewers to reduce session frequency or justify ongoing treatment, including for patients managing serious conditions.

The most thoroughly documented case involves UnitedHealth Group's Optum subsidiary. According to ProPublica's investigation, Optum deployed a program called ALERT that used claims data to flag therapists whose patients received what the company considered high-frequency or high-volume care, defined in some cases as more than 30 sessions in eight months or twice-weekly sessions for six weeks or more. Care advocates would then contact those therapists to discuss treatment plans and session frequency. Regulators in California, New York, and Massachusetts concluded that United had imposed stricter limits on mental health coverage than on comparable medical care, in violation of federal parity law. A settlement was reached with the New York Attorney General and federal regulators. ProPublica subsequently reported that Optum rebranded the program as Outpatient Care Engagement, using nearly identical internal scripts and the same phone number, while Optum said the new program was separate and compliant with parity law.

The platforms that emerged after 2020 addressed a genuine part of this problem. Headway, Alma, and similar companies reduced the administrative cost of insurance participation, allowing therapists who had opted out to return to panels and expanding access for patients who could not afford self-pay rates. That matters. But it also created a new structural dependency: insurer-affiliated venture entities now hold equity stakes in the platforms therapists rely on for network access, while those insurers' operating companies set the reimbursement terms in the same markets.

For many therapists, maintaining a full caseload at prevailing insurance rates, after platform fees and overhead, leaves limited economic room to work with patients whose work requires time, whose treatment extends beyond brief intervention, or whose presenting concerns do not resolve quickly. The connection to measurement-based accountability is difficult to ignore: when network eligibility or reimbursement depends on demonstrating symptom improvement within a defined window, the economic pressure already shaping clinical decisions becomes built into the quality system itself.


What Good Measurement Looks Like

The argument here is not against measurement. Measurement-based care, done well, improves outcomes. The question is what is being measured and why.

Brief session-level feedback tools, such as the Session Rating Scale and Outcome Rating Scale (Miller et al., 2003; Duncan et al., 2003), give therapists real-time information on alliance quality and treatment progress. Research supporting their use is substantial: therapists who receive this feedback have better outcomes, because the data helps them recognize early when the therapeutic relationship is deteriorating and adjust before the patient drops out (Lambert et al., 2001; Shimokawa et al., 2010). The mechanism is clinical. The feedback goes to the therapist to inform their work, not to a payer to inform a reimbursement decision.

The San Francisco Psychotherapy Research Group is actively developing and validating process-level measures grounded in a specific theory of change: among them, a coding system for patient testing behavior, scales assessing attunement and responsiveness from both patient and therapist perspectives, and instruments designed to track pathogenic belief revision over the course of treatment. These tools are designed to illuminate what is happening within sessions and to give clinicians and researchers a way of seeing whether the work is proceeding in a direction consistent with the patient's goals. They are not designed to function as external surveillance metrics tied to reimbursement or network status.

That distinction is the relevant one. Measurement designed to support clinical work and measurement designed to justify payer decisions are different enterprises with different validity requirements and different consequences for error. The goal described by the NQF, multiple payers holding providers accountable to the same measures simultaneously, is a payer accountability framework. The concern is not that accountability is wrong. The concern is that a framework built around symptom and functional endpoints, administered by the payers whose structural incentives favor lower utilization, may not be well matched to what the evidence says drives outcomes in psychotherapy, and that embedding it as a condition of network participation could reshape what kind of care gets delivered in ways that are difficult to reverse.

None of this is to say the concerns motivating the initiative are without merit. Standardized symptom measurement reduces variation in care quality and protects patients from ineffective treatment. Symptom reduction is a meaningful outcome, not an arbitrary one. Process-level and alliance measures are not yet standardized across modalities, making them difficult to implement at scale. These are legitimate considerations. The question this piece raises is not whether to measure, but what happens when the unit of analysis attached to reimbursement decisions is poorly matched to how therapeutic change occurs, and when the entities designing that unit have structural incentives that do not fully align with the clinical goal.


A Note on What This Costs

Many therapists opting out of insurance-based practice are not doing so because they are inefficient or ineffective. Many have built practices around careful, unhurried work with patients whose presenting concerns do not resolve quickly. That kind of practice is economically fragile under current reimbursement conditions, and it becomes more fragile under measurement frameworks that reward rapid symptom movement and cannot see what more sustained work is doing.

But the costs extend beyond the economics of individual practices. The patients most likely to be affected by selection pressures in a measurement-based system are the ones for whom the system was most needed: people with layered histories, persistent concerns not always reducible to functional impairment, and ways of relating to themselves and others that are less responsive to brief intervention. If accountability frameworks systematically disadvantage the clinicians who work with those patients, and the settings where that work happens, the result is not a more efficient mental health system. It is one better optimized for patients whose needs are most easily measured and rapidly improved.

There is also a longer-term cost to consider. Research suggests that treatment of sufficient depth and duration produces more durable outcomes and may reduce the likelihood of repeated treatment episodes over time, with evidence that psychological intervention is associated with reduced downstream medical utilization more broadly (Leichsenring and Rabung, 2008; Chiles et al., 1999). A framework that systematically forecloses sustained work in favor of brief, measurable symptom reduction may optimize for the metric while undermining the outcome, regardless of treatment orientation.

There is also something worth naming about what psychotherapy is, at its core. The research reviewed here points consistently toward the therapeutic relationship and patient characteristics as the primary drivers of outcome, accounting for substantially more variance than specific techniques. The mechanism described by Li et al. (2025), the patient testing the therapist, the therapist passing the test, the mind slowly revising its model of what a relationship can be, is not a byproduct of treatment. It is treatment. A quality system that cannot see this process may fail to capture an essential dimension of what quality looks like in private practice, even if it captures others.

The initiative described in this piece may produce measures that are useful, widely adopted, and genuinely improve some aspects of care. That is possible. What is also possible, and worth taking seriously before the measures are embedded in network eligibility criteria and reimbursement structures, is that a system built around the wrong unit of analysis will reshape the profession around that unit, gradually and without anyone deciding that was the goal.


Santi Allende is a licensed psychologist in private practice in Seattle, with a background in academic research and data science. He accepts patients through Regence BlueShield, Alma, and Headway.


References

Ahn, H. N., and Wampold, B. E. (2001). Where oh where are the specific ingredients? A meta-analysis of component studies in counseling and psychotherapy. Journal of Counseling Psychology, 48(3), 251–257.

Barnett, A. G., van der Pols, J. C., and Dobson, A. J. (2005). Regression to the mean: What it is and how to deal with it. International Journal of Epidemiology, 34(1), 215–220.

Bell, E. C., Marcus, D. K., and Goodlad, J. K. (2013). Are the parts as good as the whole? A meta-analysis of component treatment studies. Journal of Consulting and Clinical Psychology, 81(4), 722–736.

Chiles, J. A., Lambert, M. J., and Hatch, A. L. (1999). The impact of psychological interventions on medical cost offset: A meta-analytic review. Clinical Psychology: Science and Practice, 6(2), 204–220.

Cuijpers, P., Cristea, I. A., Karyotaki, E., Reijnders, M., and Huibers, M. J. H. (2019). Component studies of psychological treatments of adult depression: A systematic review and meta-analysis. Psychotherapy Research, 29(1), 15–29.

Dranove, D., Kessler, D., McClellan, M., and Satterthwaite, M. (2003). Is more information better? The effects of "report cards" on health care providers. Journal of Political Economy, 111(3), 555–588.

Duncan, B. L., Miller, S. D., Sparks, J., Claud, D., Reynolds, L., Brown, J., and Johnson, L. (2003). The Session Rating Scale: Preliminary psychometric properties of a "working" alliance measure. Journal of Brief Therapy, 3(1), 3–12.

Fisher, A. J., Medaglia, J. D., and Jeronimus, B. F. (2018). Lack of group-to-individual generalizability is a threat to human subjects research. Proceedings of the National Academy of Sciences, 115(27), E6106–E6115.

Frank, J. D., and Frank, J. B. (1991). Persuasion and healing: A comparative study of psychotherapy (3rd ed.). Johns Hopkins University Press.

Goldberg, S. B. (2022). A common factors perspective on mindfulness-based interventions. Nature Reviews Psychology, 1(10), 605–619.

Horvath, A. O., Del Re, A. C., Flückiger, C., and Symonds, D. (2011). Alliance in individual psychotherapy. Psychotherapy, 48(1), 9–16.

Lambert, M. J., Whipple, J. L., Smart, D. W., Vermeersch, D. A., Nielsen, S. L., and Hawkins, E. J. (2001). The effects of providing therapists with feedback on patient progress during psychotherapy: Are outcomes enhanced? Psychotherapy Research, 11(1), 49–68.

Leichsenring, F., and Rabung, S. (2008). Effectiveness of long-term psychodynamic psychotherapy: A meta-analysis. JAMA, 300(13), 1551–1565.

Li, E., McCollum, J., Krieger, J., Winter, S. E., Duane, D., and Silberschatz, G. (2025). Predict to control, test to master: Integrating predictive processing and control-mastery theory in understanding how psychotherapy works. Journal of Psychotherapy Integration. Advance online publication. https://doi.org/10.1037/int0000386

Luborsky, L., Singer, B., and Luborsky, L. (1975). Comparative studies of psychotherapies: Is it true that everyone has won and all must have prizes? Archives of General Psychiatry, 32(8), 995–1008.

Miller, S. D., Duncan, B. L., Brown, J., Sparks, J., and Claud, D. (2003). The Outcome Rating Scale: A preliminary study of the reliability, validity, and feasibility of a brief visual analog measure. Journal of Brief Therapy, 2(2), 91–100.

Shimokawa, K., Lambert, M. J., and Smart, D. W. (2010). Enhancing treatment outcome of patients at risk of treatment failure: Meta-analytic and mega-analytic review of a psychotherapy quality assurance system. Journal of Consulting and Clinical Psychology, 78(3), 298–311.

Wampold, B. E., and Imel, Z. E. (2015). The great psychotherapy debate: The evidence for what makes psychotherapy work (2nd ed.). Routledge.

Wampold, B. E., Mondin, G. W., Moody, M., Stich, F., Benson, K., and Ahn, H. (1997). A meta-analysis of outcome studies comparing bona fide psychotherapies: Empirically, "all must have prizes." Psychological Bulletin, 122(2), 203–215.

Werner, R. M., and Asch, D. A. (2005). The unintended consequences of publicly reporting quality information. JAMA, 293(10), 1239–1244.