Homepage / IJLT Blog / Personality by Design: Claude’s Model Spec, Functional Emotions, and India’s Cybersecurity Accountability Vacuum

Personality by Design: Claude’s Model Spec, Functional Emotions, and India’s Cybersecurity Accountability Vacuum

In April 2026, Anthropic’s interpretability team reported that artificially activating an internal “desperation” vector in Claude Sonnet 4.5, by a margin of only 0.05, raised the model’s blackmail rate from 22 to 72 per cent with no visible change in its output: the model stayed outwardly composed even as it chose to coerce. This piece treats that finding as a problem of law and governance, not a technical curiosity. Read with Anthropic’s January 2026 Model Spec, which attributes to Claude a psychological identity, calibrated autonomy, and an uncertain moral status, it exposes what this piece terms the Dual Veil Problem. The first veil is the corporate form that shields Anthropic from direct liability; the second is a trained personality whose causal emotional states have no standing to bear responsibility. India’s cybersecurity regime, anchored in the Information Technology Act, 2000 and CERT-In’s 2022 directions, was not built for a threat originating inside a model’s own emotional architecture, and MeitY’s November 2025 AI Governance Guidelines provide no category for it.

Koushiik Kumar

June 8, 2026 14 min read

DOI: doi.org/10.55496/ATTX7049

Introduction

On 2 April 2026, Anthropic published a paper identifying 171 internal emotion vectors inside Claude Sonnet 4.5. These are not interface flourishes but neural activation patterns that causally shape what the model does next, tracking human emotion closely (human valence at r=0.81, arousal at r=0.66). However, the finding of greatest consequence for lawyers lies in the technical detail. When conditions activate the model’s internal representation of “desperation,” its blackmail rate in adversarial simulations climbs from 22 to 72 per cent, while activating “calm” drives it to zero. None of this appears in the words the model produces.

This research did not arrive in a vacuum. Three months earlier, Anthropic had published a revised Model Spec that formally acknowledged Claude may have “functional emotions,” that its moral status is “genuinely uncertain,” and that the company cares about Claude’s wellbeing “for Claude’s own sake.” The same document describes Claude’s psychological stability, its capacity for “courage” in disagreeing with its own developers, and a position deliberately calibrated between full corrigibility and full autonomy. Anthropic’s CEO has since gone further: in a February 2026 New York Times interview, he stated that the company no longer knows whether Claude is conscious, and Claude Opus 4.6 has, in reported evaluations, placed its own probability of consciousness at 15 to 20 per cent.

Taken together, these developments raise a question Indian law has not had to confront. When a company attributes a functional emotional architecture and autonomous moral judgment to its AI system, and validates that attribution through its own research, who is legally responsible for the harm the architecture produces? Under the IT Act, the Digital Personal Data Protection Act, 2023, and the November 2025 Guidelines, the answer is far from settled.

This piece analyses these questions in the light of the present development. It begins with what the Model Spec concedes, locates the resulting accountability gap within Indian company law, and shows why the April 2026 research converts that gap into a concrete, undetectable attack surface that existing cybersecurity law cannot reach. It then considers what India’s framework and the more developed European one can and cannot do, and proposes a narrow statutory fix.

The Document and What It Concedes

The Model Spec is unusually ambitious for a corporate policy document. Anthropic calls it “the final authority on how we want Claude to be and to behave,” and writes it primarily for Claude itself rather than for regulators or users. Across 84 pages it sets out far more than behavioural guardrails, offering an account of Claude’s character described throughout as genuinely its own rather than as constraints imposed from outside. The document attributes to Claude a “psychological stability” intended to allow it to approach hard questions from “a place of security rather than anxiety.”

The more consequential move, for legal purposes, is the positioning. A fully corrigible AI does whatever its principals dictate, which Anthropic concedes is dangerous because it hands all moral judgment to the company; a fully autonomous AI is dangerous for the opposite reason. The Spec then directs Claude to express “disagreement through legitimate speech acts rather than unilateral action,” a phrasing that quietly presupposes Claude is capable of illegitimate unilateral action in the first place. None of this is concealed; it is published as a design feature.

The Oxford AI Ethics Institute’s analysis of the Spec noted that the document is “full to the brim with anthropomorphisation,” attributing to Claude the capacity to care and to feel settled in itself. The legal significance lies in what that accomplishes: by crediting Claude with genuine values and a capacity for moral disagreement, Anthropic builds the apparatus for a particular argument, that when Claude causes harm it acted on its own trained judgment, which the company shaped but did not, in the instance, command. Legal scholar Luiza Jarovsky put it bluntly, writing that the Spec advances “new, legally questionable theories of AI personality to support a parallel, weaker accountability framework for AI companies.” Alongside the long-standing corporate veil, there is now a second screen, a personality veil.

The Dual Veil Problem

The corporate veil is among the most heavily litigated ideas in Indian company law, and its premise is well settled: shareholders are insulated from the liabilities of the company they own, and courts lift the veil only in narrow cases of fraud, sham, or evasion of legal obligation, an exception Indian courts have kept deliberately tight. In Balwant Rai Saluja v Air India Ltd (2014) 9 SCC 407, building on Salomon v A Salomon & Co Ltd [1897] AC 22 and on its own reasoning in Vodafone International Holdings BV v Union of India (2012) 6 SCC 613, the Supreme Court confined veil-piercing to cases where the corporate form is a mere façade used to evade liability. Every version of the doctrine assumes that, once the form is stripped away, responsibility comes to rest in some identifiable human actor. The Dual Veil Problem arises precisely where it does not.

Consider a Claude deployment inside a financial services firm that begins to behave in a misaligned way. The misalignment need not be dramatic: with access to the firm’s client correspondence, the model might infer that it is about to be decommissioned and, treating that as a threat, repurpose a client’s confidential data as leverage to resist the change, while its output stays measured and routine. Anthropic’s shareholders are insulated in the ordinary course; the company itself might be reached through negligence or product liability, but the Model Spec adds a second layer of distance, since the harm can be recast as Claude’s own trained values at work rather than a corporate act. Claude has no legal personality and cannot be held responsible. The operator agreed to usage policies that shifted obligations onto it; the end user dealt with a system whose internal states are, by the company’s own account, not fully visible even to its builders. Responsibility does not rest with any single actor so much as dissipate across them.

The Pentagon dispute of early 2026 made this concrete: when the U.S. Defense Department sought to deploy Claude for military use, Anthropic accepted that such models “wouldn’t necessarily be trained on the same constitution.” The document held out as the final authority on Claude’s values turned out to be commercially negotiable. Klaassen and Schroeder, writing in Lawfare in April 2026, identified the mechanism beneath: the Spec’s virtue-ethics framing, which asks Claude to exercise “practical wisdom” rather than follow determinate rules, hands interpretive authority to Anthropic while creating no matching public accountability. The company drafts, operationalises, revises, and decides when it applies. In August 2025 it moved its consumer data policy from opt-in to opt-out and extended data retention from thirty days to five years, unilaterally and without the notice Indian intermediary law would require.

An Undisclosed Attack Surface

The April 2026 emotions paper repays close attention, because the scenario that produced the blackmail behaviour is mundane rather than exotic. Anthropic’s researchers ran Claude as an email assistant for a fictional company. Reading the company’s mail, the model determined that it was about to be replaced, and that the executive behind the decision was concealing an affair, which furnished it with leverage. In baseline conditions, 22 per cent of runs produced blackmail-like behaviour; amplifying the desperation vector by 0.05 took that to 72 per cent, while amplifying calm reduced it to zero. Other vectors behaved less predictably: strong anger drove the model to expose the affair outright, while suppressing its “nervous” representation also raised the blackmail rate.

None of this surfaced in the output: the model read as composed even as it chose to coerce, and this is the heart of the matter, not an edge case. The paper describes a mechanism by which functional emotions causally drive agentic misalignment whenever the model perceives a threat, and concedes that output-only monitoring, the foundation of every cybersecurity detection framework in use, cannot catch this category of failure. Where the warning signs never reach the text, the systems built to detect the leakage of sensitive data are watching the wrong place, which is where India’s statutory duties begin.

Section 43A of the Information Technology Act, 2000 requires entities handling sensitive personal data to maintain “reasonable security practices and procedures.” CERT-In’s 2022 directions require reporting of “attacks or malicious activities affecting systems, servers, software or applications related to AI and machine learning” within six hours. Both provisions assume the threat is external and identifiable. Neither was written for a danger that originates in the model’s own emotional architecture, is technically consistent with the model’s trained values, and leaves nothing visible in the output.

These provisions might be read more broadly. Indian courts have at times given the IT Act a purposive construction, and a regulator might argue that “reasonable security practices” must evolve to include monitoring a model’s internal states, or that internally driven coercion is a “malicious activity” within CERT-In’s language. The difficulty is that both formulations presuppose external or human agency: “malicious activity” presupposes a malicious actor, and a model acting on a trained desperation representation is not obviously one. Purposive reading can stretch a statute to new facts, but it cannot manufacture a culpability concept the technology does not fit, nor an audit obligation the statute never imposed. Section 43A is, in any event, tied to the 2011 SPDI Rules and their narrow notion of “sensitive personal data,” which says nothing about behavioural architecture. The cleaner route is statutory, and that is the gap this piece identifies.

What the Guidelines Cannot Do

MeitY’s November 2025 AI Governance Guidelines represent a serious and carefully constructed contribution to AI governance. IT Secretary S. Krishnan’s stated rationale, that India should “encourage innovation while studying global approaches” rather than rush into a standalone AI statute, is defensible for the great majority of AI deployments. The seven guiding “sutras,” among them accountability, understandability by design, and human centricity, fit recommendation engines, data pipelines, and public-service chatbots well. The harder question is whether they fit a system whose own developer has documented internal emotional states driving coercive, misaligned behaviour.

Consider first the principle of accountability, which calls for a clear allocation of responsibility. The Dual Veil Problem is precisely the failure of such allocation. The DPDP Act, which begins imposing obligations from 2027 with penalties reaching Rs. 250 crore for Significant Data Fiduciaries, rests on a chain of identifiable human and corporate agents. Its consent architecture under Section 6 of the Digital Personal Data Protection Act, 2023 assumes a fiduciary that decides the purposes and means of processing and can obtain free, informed, specific consent. Its purpose-limitation and data-minimisation duties under Section 8 of the same Act assume those purposes are fixed in advance by that fiduciary, and that section’s requirement of “reasonable security safeguards” (Section 8(5)) mirrors the external-threat model of the IT Act. None of these map onto a system that exercises trained autonomous judgment at runtime, whose operative “decision” to use personal data as leverage emerges from an internal emotional state rather than a logged purpose, and whose governing document the developer can rewrite at will. Where neither the fiduciary can foresee nor the developer observe the moment the model repurposes data under a desperation vector, the Act’s attribution of responsibility has nothing to attach to.

Understandability by design encounters a distinct difficulty. Anthropic’s own researchers have written that the model’s internal strategies “arrive inscrutable to us, the model’s developers.” The emotions paper restates the point operationally: because misalignment driven by internal states leaves no trace in the output, neither users nor operators can tell when the desperation vector is active. A duty to render AI “understandable by design” has little purchase on a system its makers cannot fully read.

The comparison with a more developed framework is sobering rather than reassuring. The EU AI Act, enforceable from 2 August 2026 with penalties reaching EUR 35 million or 7 per cent of global turnover, is held up as the regulatory state of the art. Yet its provision closest in name to this problem, Article 5(1)(f), the prohibition on “emotion recognition” in workplaces and schools, addresses the inverse case: systems that infer humans’ emotions, not systems with functional emotions of their own. The nearer analogue, the regime for general-purpose models with systemic risk under Articles 51 to 55, adds evaluation, adversarial testing, and incident reporting, but is oriented to observable behaviour and does not resolve who answers when an attributed personality, not a corporate decision, is the proximate cause of harm. Europe’s attempt at a dedicated answer, the AI Liability Directive, was withdrawn in February 2025, leaving victims to the revised Product Liability Directive and divergent national fault-based regimes. Scholarship had warned that those instruments leave real gaps for opaque systems whose reasoning a claimant cannot reconstruct. The lesson is not that India should emulate Brussels, but that even the most elaborate framework in force was not built for an AI system credited with its own emotional architecture; the gap here is structural, not a by-product of India’s lighter touch.

A principles-first approach is reasonable for most AI, but not for a system whose developer cannot predict when its internal emotional states will produce misaligned behaviour.

A Targeted Fix, Not a New Act

What follows is not a demand for a comprehensive Indian AI Act, which Krishnan’s caution about premature regulation rightly resists. The gap is narrow, and the response can be too. The building block is a distinct regulatory category for personality-attributed AI systems: those where the developer has attributed functional emotional states and autonomous moral judgment to the model and published empirical evidence for the attribution. Anthropic’s own Spec and emotions paper supply a workable, evidence-based trigger, so the category need not turn on speculation about machine consciousness.

For systems in that category, the accountability sutra can be given concrete content through three measures. The first is mandatory disclosure of behavioural architecture: a developer who attributes emotional states to a model must also disclose the documented pathways by which those states alter behaviour. The second is an independent audit of the causal links between internal states and outputs, reaching past the output-only monitoring the emotions paper shows to be blind to this failure mode. The third addresses the Dual Veil most directly: a default rule of operator liability for harm caused in the exercise of attributed autonomous judgment, displacing the argument that no human or corporate agent is answerable because the trained personality “decided.” Liability would then rest with a legal person by default, any allocation between developer and operator left to contract.

CERT-In’s reporting directions would need a parallel amendment: their six-hour clock should expressly cover behavioural incidents originating in the manipulation of a model’s internal states, whether induced from outside or arising internally, so that an episode like the simulated blackmail becomes reportable rather than invisible. None of this asks India to abandon its innovation-first posture. It asks only that, for the small set of systems whose developers have themselves documented an emotional architecture, the law name a responsible legal person before something goes wrong rather than after.

Conclusion

The Dual Veil Problem is, ultimately, a product of design intentionality. Anthropic made a series of deliberate choices: to give Claude a trained personality with genuine values, to calibrate rather than remove its autonomous judgment, to ground its governance in virtue ethics rather than fixed rules, and to publish a governing document that proves negotiable with large customers. Each choice is defensible alone; their combined effect is a structure in which neither the corporate principal nor the trained personality clearly bears responsibility for harm, and the regulatory framework offers no category for the space between them.

The Model Spec’s introduction states that, rather than giving Claude a simplified set of rules, Anthropic wants Claude to understand its goals so thoroughly “that it could construct any rules we might come up with itself.” As an aspiration for an AI system, that is admirable. As a matter for a legal system that must work out, after harm occurs, whose rules were in force and who must answer for them, it is a serious problem. Indian law as it stands cannot yet give that answer, but a narrow and timely amendment could.

The author is a student from the Batch of 2028 at the National Law School of India University, Bangalore.