Beyond Observation

😰 The Evaluation Gap: When Good Leadership Breaks the Corporate Matrix

Markus is currently wiping cold sweat off his forehead. A very specific sweat. It’s the cold sweat that breaks out when the corporate performance system demands a mandatory rating for an employee you haven’t watched working in six months.

For half a year, Markus played the enlightened servant leader flawlessly. He backed out of the Dailies. He stopped hovering in Refinements. He supported his engineers individually and let Coach Lisa build psychological safety, enabling the team to wrestle the new load-balancing microservices into submission in peace. And they delivered. On time. It was a masterpiece of modern management.

Right up until the moment company policy required a concrete rating on “Core Competencies” by Friday noon. Markus realizes with sudden, bureaucratic horror that by successfully treating an 80K developer like an adult instead of an assembly line worker, he nuked his own ability to complete the mandatory evaluation. He is staring directly into the Evaluation Gap.

🎭 The Physics of Observation: Why Visibility Guarantees Lying

The standard corporate mandate for “observable behavior” in complex knowledge work is a scientifically illiterate joke. It ignores a psychological absolute proven back in 1939: the Hawthorne effect. As soon as people know they are being watched by authority, they alter their behavior to please the observer.

If you are familiar with Bjarte Bogsnes and Beyond Budgeting, you already understand the underlying mechanics. Bogsnes doesn’t just limit his framework to finance. He continuously proves in his talks that tying performance evaluation to rigid annual targets guarantees institutionalized sandbagging and lying across the entire organization. We are translating Bogsnes’ enterprise blueprint into actionable survival tactics for the operational machine room. Think of it as Beyond Observation. Tying evaluation to physical visibility triggers the exact same dynamic at the developer’s desk.

Deep cognitive work looks identical to slacking off. You absolutely do not stare blankly out the window for twenty minutes conceptualizing a complex database migration when the guy grading you on a one-to-five scale is hovering nearby. You type aggressively. You stare intently at your IDE. You optimize for the visible appearance of suffering.

A badger in a hoodie working intensely at two monitors while a beaver in a suit hovers directly behind him, clutching a red folder and observing. — The Impression Management Gap: When disciplinary power enters the room, deep cognitive work is immediately replaced by the visible theater of ’looking busy'.

The moment disciplinary power enters the room, human biology instantly triggers what psychologists Wayne and Liden formally defined as “Impression Management”. Their longitudinal studies proved that employees subjected to observational performance ratings are forced to waste massive amounts of cognitive energy on looking busy for the boss instead of doing the work.

If he pulls up a figurative chair behind Tom to “observe” his performance, Markus effectively acts as a highly paid theater critic reviewing an amateur actor.

If he pulls up a literal chair, he should be fired for gross managerial incompetence. Given European labor laws, he would likely be reassigned to lead the company’s new Return-to-Office task force.

Trying to measure visible behavior inevitably triggers Goodhart’s Law. When a measure becomes a target, it ceases to be a good measure. If management measures lines of code written to observe productivity, the system is immediately gamed. Developers will write bloated, verbose code to hit their target. They will effectively destroy the codebase to secure their bonus.

Observation inherently destroys the very data it is trying to collect. There is some Heisenberg joke hidden in here, but I can’t pinpoint the location.

🐌 The Delivery Paradox: Daily Deployments, 52-Week Feedback

Modern engineering teams operate in a continuous flow. They deploy code multiple times a day. They run continuous integration pipelines that provide feedback in milliseconds.

Yet, the corporate performance review operates on an agricultural timeline. It assumes that performance can be harvested once a year in November.

This temporal mismatch is fatal. When organizational psychologists like Seymour Adler analyzed the modern performance review, their conclusion was devastatingly simple. The annual cycle is functionally dead.

If an engineer makes a poor architectural decision in March, waiting until December to address it in an annual appraisal means the company spent nine months paying for the compounding interest of that mistake. By the time Markus opens the evaluation form, the code is already in production, it has likely been refactored twice, and the administrative feedback is useless for operational learning. The agricultural rhythm simply forces managers to rely on the recency bias. They grade the entire year based on whatever the employee did in the three weeks leading up to the review.

⚖️ The 94/6 Reality: Why You Are Grading Your Own System

The classic performance evaluation assumes that a delayed feature or a broken deployment is the direct result of Tom’s personal effort, skill, or attitude.

Edwards Deming, the father of modern quality management, dismantled this assumption decades ago. Through rigorous statistical analysis, Deming proved what is now known as the 94/6 rule. In complex, managed systems, roughly 94 percent of performance issues belong to the system itself. They are caused by the architecture, the management structure, the legacy technical debt, or the approval processes. Only 6 percent of performance variation belongs to the individual worker.

A comic illustration of an office. On the right, a quokka holds a large pie chart divided into 94% and 6%. On the left, a badger, an owl, and three dogs collaborate intensely around a dual-monitor engineering workstation. — Deming’s 94/6 Rule: The engineering team collaboratively navigates the 94 percent of systemic constraints and shared successes, while the HR form demands you grade Tom solely on his 6 percent.

Suppose Tom takes three weeks to deliver a seemingly simple API update. A manager relying on observable behavior sits down in December and gives Tom a standard 3/5 rating for Goal Attainment, officially branding him an average performer.

The manager cannot see the invisible system constraints. They do not see that Tom wrote the code in two hours, but then spent fourteen days waiting for the security team to approve a firewall rule. When Markus sits down to fill out an individual review, he is statistically just grading the organization’s broken CI/CD pipeline and blaming it on the developer.

The Moral Hazard of Systemic Excuses

Applying a manufacturing statistic rigidly to software engineering creates a massive moral hazard. Unlike an assembly line worker, a software engineer builds the technical factory. They wrote the legacy code. Teaching employees that 94 percent of their failures are the fault of the architecture provides a statistically backed excuse for sloppy coding. An employment contract is an agreement of individual accountability. Deming’s rule cannot be used as a shield for gross negligence.

A broken CI/CD pipeline is a systemic failure. Merging untested code into main at 4 PM on a Friday is a personal choice.

A senior developer’s 6% individual effort is measured precisely by how they interact with the 94% systemic technical debt. Do they leave the codebase cleaner than they found it, or do they use the broken pipeline as a permanent excuse for missed deadlines?

The traditional annual review fails spectacularly because it mixes the 94% technical system noise and the 6% individual effort into one useless, arbitrary number. To hold Tom truly accountable for his actual contribution, the organization must abandon the illusion of observation and drop three structural anchors to isolate the data.

⚓ Anchor 1: The Artifact Pivot

If you cannot observe the human, you must observe the work. Insisting on watching your people type and interact with peers to measure productivity is an admission that you lack the technical depth to evaluate the engineering. Make no mistake: A People Lead does not need to be a domain expert to evaluate fairly. But if you cannot read the code, you cannot compensate by watching the clock.

Complex knowledge work leaves a massive, indelible digital trail. Every pull request, every architectural design document, every incident post-mortem is a tangible artifact. When you pivot to evaluating these artifacts, charisma doesn’t compile. An asynchronous pull request is immune to meeting rhetoric or corporate politics. It validates pure logic. Markus shouldn’t care if Tom was staring blankly out the window before his lunch break. He needs to evaluate the professional grade of the load-balancing logic Tom merged the following afternoon.

The Macro vs. Micro Trap

What about “glue work”? Mentoring juniors, unblocking peers, answering questions? If management demands a digital timestamp for every single collaborative act, they trigger Documentation Theater. If a senior knows his HR rating depends on formal artifacts, he will refuse to answer a quick five-minute Slack question. Instead, he will demand the junior open a formal incident ticket so he can write a post-mortem to secure his metrics. He might even let a deployment fail just to be the documented hero who fixed it.

To prevent this weaponization of agile processes, management must explicitly separate macro from micro glue work. Macro glue work (architectural redesigns, formal mentoring arcs) leaves a natural artifact trail like Architectural Decision Records (ADRs) or co-authored pull requests. This is evaluated. Micro glue work (a quick pairing session, a Slack answer) is structurally unquantifiable. Attempting to track it destroys it. If a senior stops doing micro glue work, it will not show up in the artifact trail; it will show up in the qualitative peer feedback, where the manager addresses it as a behavioral issue, completely divorced from the compensation matrix.

Evaluating artifacts does not mean installing a tool to track Tom’s commits or measure his individual Jira touch time. Under the GDPR and local labor laws, automated repository surveillance can constitute illegal, continuous performance monitoring. The pivot requires qualitative code reading by peers, avoiding a new automated panopticon.

If Tom is an actual low-performer who needs to be let go, European labor laws require a heavily documented trail of individual failures to justify a termination (Abmahnung in Germany). Labor courts require proof of a persistent breach of duty. A manager cannot weaponize a peer feedback loop to build this legal case. That approach turns colleagues into corporate informants and destroys all psychological safety.

How does a manager gather objective proof without attending daily meetings, running automated surveillance, or using informants? The evidence lives in the formal escalation paths. The Abmahnung must be built on objective data. It requires a Security Lead formally escalating a bypassed protocol, an architect documenting a direct refusal to follow compliance standards, or formal HR grievances from colleagues regarding toxic behavior. The manager relies on official procedural guardrails, completely isolated from the developmental peer feedback loop.

Which – if I may voice my personal opinion – is how it should be. If you are uncertain about worker’s positive intent, don’t hire them. If you already did, develop them. Or if you fail to do that, help them find work elsewhere. Employees are already in a weak position by design, although they are creating the value. Treating them like humans is the least you can do.

The Pair Programming and Junior Exception

Evaluating artifacts relies on attribution. If Sarah and Tom pair program a complex feature, but only Tom pushes the commit, Sarah’s contribution is invisible. Mitigating this does not require documentation theater. It requires basic engineering hygiene. Modern version control systems natively support Co-authored-by tags. Adding a tag during a commit is a technical standard, not a bureaucratic HR form. If a team practices permanent pair or mob programming, the artifact belongs to the collective, and the manager credits all authors for the systemic impact.

This absolute decoupling only works for seniors. If Tom is a Junior Developer, grading purely on artifacts is cruel and negligent. A Junior’s initial artifacts are often flawed by default. They need pair programming, shadowing, and direct guidance laterally provided by Senior Developers. The People Lead’s job is to hold the Seniors accountable for coaching the Juniors, while strictly quarantining that learning process from the HR system.

⚓ Anchor 2: The 360-Degree Reality Check

To evaluate the operational reality and skill development, Markus must ask the only people who suffer the consequences of Tom’s work. He must ask Tom’s peers. The traditional corporate hierarchy operates on the stubborn assumption that the manager has the best vantage point to judge an employee’s performance. In interdependent knowledge work, the manager is the blindest person in the room.

Tom will never act like a toxic prima donna during a scheduled one-on-one with Markus. Once Markus leaves the Slack channel, Tom might be aggressively blocking pull requests and leaving the junior developers to clean up his undocumented mess. A manager trying to grade teamwork from a vertical distance is essentially guessing who hides their dysfunction best. Objective reality requires a structured, asynchronous peer feedback system.

Works Councils and the Duty of Care

Building this system introduces immediate friction. In many European companies, establishing a peer feedback tool requires active negotiation with the Works Council (Betriebsrat). A middle manager like Markus cannot negotiate this. Executive leadership must formalize this explicitly as a tool strictly for skill development, effectively prevent its use for horizontal mobbing or disciplinary action. If the C-suite avoids this negotiation, they force their middle managers to fly blind.

Even with a system in place, a People Lead cannot simply abdicate responsibility and let horizontal conflict resolution govern the team. Management retains a strict Duty of Care (Fürsorgepflicht). You cannot effectively monitor an employee’s psychosocial health if you completely vanish. Burnout does not happen in a vacuum. It happens in the work.

Markus stays out of the Dailies and Refinements. His mere presence as a disciplinary power alters the team’s natural dynamics. Coach Lisa owns the systemic safety of those rituals. Markus frees up the capacity to observe the human where it is actually safe. Freed from discussing Jira tickets and system lead times, his 1:1s shift entirely to monitoring workload, psychosocial health, and preventing burnout. This is exactly where Markus detects the invisible glue work or the toxic knowledge hoarder. He does not need to read the peer feedback or force an artifact trail. He sees the impact directly in the exhaustion levels of the junior developers during their 1:1s. He acts as the ultimate escalation layer to protect the individual employee, intervening only if the lateral system fails.

Even if management successfully establishes this peer feedback loop, there is one final hazard that will instantly destroy it. The mechanism relies entirely on technical honesty. But when an engineer knows their peer review will directly determine a colleague’s financial reality for the next twelve months, they will lie. If Sarah knows her critique will cost Tom his bonus, she will protect him. The engineering department will form a protective cartel and rate everyone as a five-star top performer.

⚓ Anchor 3: The Salary Decoupling

If you want Sarah and Tom to critique each other’s code without forming a protective cartel, you have to remove the financial hostage situation.

Most engineers genuinely want to build good software, and they are capable of dissecting a botched deployment together to learn from it. That technical honesty requires a safe environment. If a peer review dictates a colleague’s financial reality, the technical truth takes a back seat to basic human solidarity. To keep the feedback system from turning into a polite fiction, executive leadership must sever its connection to the payroll.

The corporate world documented this exact failure mode decades before the first agile manifesto was drafted. In a landmark 1965 study at General Electric, researchers proved that combining performance appraisals with salary discussions actively destroys the feedback process. When compensation is on the table, the employee’s brain perceives a threat to its security and immediately shifts into a defensive posture. Any constructive criticism offered in that same meeting is effectively ignored, because the listener is focused on justifying their financial worth.

Equipped with 60 years of empirical proof that this meeting format destroys value, modern enterprises naturally made it mandatory for everyone. It is the corporate equivalent of touching a hot stove every December just to make sure it still burns.

An owl in a suit crosses her arms, refusing to look at a red appraisal folder held by an anxious beaver. — The Feedback Cartel: The moment an engineer knows their critique will directly affect a colleague’s financial reality, technical honesty dies.

This conflict multiplies when an organization introduces individual bonuses into highly interdependent systems. If management financially incentivizes individual attribution, they penalize teamwork. Suppose Tom’s annual payout depends purely on the items his name is stuck to. He now has a massive financial incentive to ignore Sarah when she asks for help debugging a critical production issue. To protect his bonus, he will focus on pushing his assigned tickets to the finish line while the surrounding architecture degrades.

The fix requires a strict quarantine. Continuous artifact reviews and developmental peer feedback serve purely as operational alignment and skill development. The compensation discussion happens on an entirely different schedule. Compensation must be decoupled from individual operational missteps and peer opinions.

🌊 The Broader Blueprint: Measuring Flow, Not People

Once you stop observing individuals and grading them for system constraints, the required shift in metrics is brutally obvious. Stop measuring the resources, and start measuring the pipeline. In Accelerate, Nicole Forsgren and her team analyzed the largest empirical dataset on software delivery. They proved that elite engineering organizations achieve their status by optimizing team-based flow, rendering individual utilization metrics obsolete.

They track Lead Time, Deployment Frequency, and Mean Time to Restore. These are systemic metrics. They cannot be achieved by a single rockstar developer hoarding knowledge, and they cannot be faked through Impression Management. They require a healthy, collaborative environment to function.

These three anchors form the basic foundation for getting there. Combined with a systemic understanding of Deming’s 94/6 rule and the dangers of Goodhart’s Law, they provide the blueprint for performance management.

How you execute this blueprint depends entirely on the gravity of your organization.

Path A: The Structural Reset (SMEs and Startups)

If you operate in an environment free from rigid legacy structures, the solution is structural. Executive leadership decouples individual performance from variable pay entirely. They base compensation on transparent market adjustments, capability bands, and the collective profit-sharing of the entire organization. They empower their People Leads to abandon the annual review and focus purely on facilitating the continuous, horizontal feedback loop.

Path B: The Forced Curve Trap (The Enterprise Reality)

What if decoupling is legally or structurally impossible right now? In many European enterprises, millions of employees operate under collective bargaining agreements (Tarifverträge) or legacy corporate mandates that explicitly force management to tie variable compensation to individual performance appraisals.

A middle manager cannot unilaterally rewrite a union contract. They also cannot illegally manipulate the ratings to subvert the system. Falsifying financial HR documents is a massive breach of duty, inviting discrimination lawsuits, Works Council interventions, and immediate dismissal.

Markus needs a solution for Friday noon. He opens the HR system to rank his team.

Corporate budgets enforce a zero-sum game through a forced distribution curve. This guarantees a casualty. What happens when Markus has a high-performing team where nobody failed their operational mandate? The curve dictates that a certain percentage of the team must be rated at the bottom, regardless of absolute performance.

Markus is trapped. He evaluates both the system builders and the pure executors against the explicit expectations of their specific roles based on objective macro artifacts. But because the Bell Curve is a zero-sum budget, rewarding one employee means taking that reward from someone else. He has to financially penalize a successful employee to fund the bonus of another.

Sounds healthy, eh?

Any attempt by middle management to make this process fair is an illusion. Markus pays the HR tax, hits submit, and inflicts the mandated financial damage on his team.

👑 The Ultimate Anchor: Executive Accountability

Markus is a victim of the organizational matrix, exactly like his developers. Middle managers cannot fix a broken enterprise HR system from the bottom up.

If an organization forces its managers into Path B, the executive leadership has failed. You cannot demand continuous delivery, agile collaboration, and elite DevOps metrics while defending an agricultural HR process built on forced distribution curves.

While the engineering team owns the technical 94 percent, the C-Level owns the organizational 94 percent. The executive team owns the HR matrix, the Bell Curve budgets, and the legacy compensation models. If managers have to penalize flawless execution just to satisfy a rigid budget curve, that is not a middle management failure. That is organizational pathology orchestrated by the C-Suite.

The ultimate fix is not a clever management hack. Executive leadership must officially abolish forced distribution curves. This doesn’t require a multi-year battle with the Works Council. Smart employee representatives already know these legacy systems are fundamentally unfair to the worker. Until compensation aligns with actual engineering flow, any agile transformation is just corporate theater, paid for by the burnout of middle managers and the silent departure of core technical executors.

📋 TL;DR: The 30-Second Reality Check

For the executives wondering why their mandatory performance reviews are generating friction instead of value, here is the structural reality:

🎭 Observation Guarantees Theater: Measuring “visible effort” replaces deep cognitive work with the theater of looking busy.
🐌 The Agricultural Timeline: You cannot manage a continuous daily delivery pipeline with a 52-week delayed annual review.
⚖️ You Are Grading The System: 94% of performance issues are systemic. Grade individuals by how they improve the architecture, not by how they suffer under it.
⚓ Evaluate Artifacts, Not Humans: Evaluate the digital trail (PRs, ADRs). Warning: Automated tracking violates GDPR, and demanding artifacts for informal help triggers Documentation Theater.
🔄 Managers Are Blind: Top-down evaluation just measures corporate politics. You need peer feedback. Warning: Managers must stay out of this feedback loop. Using it for HR ratings destroys psychological safety and guarantees lying. Use 1:1s purely for your Duty of Care: uncovering invisible glue work, detecting toxic behavior, and preventing burnout.
💰 Separate Feedback from Salary: If a peer review decides a bonus, the team forms a protective cartel and lies. Decouple learning from earning.
👑 The C-Level Mandate: Middle managers cannot hack a zero-sum Bell Curve without penalizing success. Executives must officially abolish forced distribution models.

🧾 The Receipts: The Psychology and the Data

Psychology does not negotiate with HR policies. If your executive board demands proof that their observational metrics are actively destroying your engineering culture, put these references on the table.

The Artifact Pivot (ROWE): Kelly, E. L., Moen, P., & Tranby, E. (2011). “Changing Workplaces to Reduce Work-Family Conflict: Schedule Control in a White-Collar Organization.” American Sociological Review. To prove that evaluating results instead of visible office time improves both productivity and well-being, cite this study. Review the study here
The Necessity of Peer Feedback: Hoffman, B. J., & Woehr, D. J. (2009). “Disentangling the meaning of multisource performance rating source and dimension factors.” Personnel Psychology. To explain why top-down managerial evaluation is fundamentally flawed, cite this construct validity study. It proves that multisource (360-degree) ratings provide a significantly more accurate picture of actual performance than single-source managerial reviews. Review the study here
The Split of Evaluation and Salary: Meyer, H. H., Kay, E., & French, J. R. P. (1965). “Split Roles in Performance Appraisal.” Harvard Business Review. To back up the mandate that you must decouple feedback from the payroll, cite this landmark General Electric study. They proved that combining salary discussions with performance evaluations creates defensive behavior and kills intrinsic motivation. Read the article here
Why Individual Bonuses Destroy Tech Teams: Garbers, Y., & Konradt, U. (2014). “The effect of financial incentives on performance: A quantitative review of individual and team-based financial incentives.” Journal of Occupational and Organizational Psychology. To prove that individual financial incentives actively harm interdependent work, cite this quantitative review. They established that team-based financial incentives drastically outperform individual bonuses in collaborative environments. Review the study here
The Death of Intrinsic Motivation: Deci, E. L., Koestner, R., & Ryan, R. M. (1999). “A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation.” Psychological Bulletin. To explain the psychological damage of the corporate bonus structure, cite this meta-analytic review. They proved that tying extrinsic rewards to complex, interesting tasks actively destroys intrinsic motivation. Review the meta-analysis here
The 94/6 Rule of System Performance: Deming, W. E. (1986). Out of the Crisis (ISBN: 978-0262541152). To prove that managers are usually just grading broken CI/CD pipelines, cite Dr. Deming’s foundational text. He provided the statistical proof that 94 percent of performance issues belong to the system, and only 6 percent to the individual worker.
Goodhart’s Law (The Metric Trap): Goodhart, C. A. E. (1975). “Problems of Macroeconomic Management: The Implications of Objective Functions.” Monetary Theory and Practice. To explain why tracking lines of code just results in bloated codebases, reference this foundational paper. It established the absolute rule that when a measure becomes a target, it ceases to be a good measure. Read the paper here
Measuring Team Flow over Individual Output: Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps (ISBN: 978-1942788331). To justify shifting metrics away from individual utilization toward system flow, cite this definitive statistical proof based on DORA metrics.
The Biology of Impression Management: Wayne, S. J., & Liden, R. C. (1995). “Effects of impression management on performance ratings: A longitudinal study.” Academy of Management Journal. To explain why observation forces developers to act busy rather than think deeply, cite this longitudinal study. They documented how employees alter their behavior purely to manage the impressions of observing supervisors. Review the study here
The Death of the Annual Review: Adler, S., Campion, M., Colquitt, A., Grubb, A., Murphy, K., Ollander-Krane, R., & Pulakos, E. D. (2016). “Getting Rid of Performance Ratings: Genius or Folly?” Industrial and Organizational Psychology. To back up the claim that the 52-week evaluation cycle is functionally dead, cite this paper. They dismantle the traditional annual performance review as fundamentally unsuited for modern, agile work environments. Read the debate here
Beyond Budgeting & The Annual Trap: Bogsnes, B. (2008). Implementing Beyond Budgeting (ISBN: 978-1119152477). To prove that target setting, performance evaluation, and resource allocation must be decoupled into separate, continuous processes to prevent institutionalized lying, cite this definitive guide. It proves that traditional annual cycles mathematically force managers to inflate ratings and requests to survive.