Phishing for clicks is easy.
Phishing for credentials a little harder.
Phishing for shells is the money shot.
But phishing defence has always been done in half measures. Let's break down how to do it properly, and the delta between the status quo and proper.
What control participation looks like, and where it exists.
Controls don't deploy themselves. They land through workstreams owned by named teams, with security architecture in the room. Each workstream below names the teams that own it and what they deliver. The recurring loop check on each (done? when? changed?) sits with the methodology phases and SOC loop further down, where the operational loop lives.
-
DMARC and DNS hygiene.Security architecture · Network and DNS · Mail platform
SPF and DKIM enforced. DMARC moved from monitor to quarantine to reject, with the path actually walked. Subdomain delegation hardened. Reports actively monitored, not piped to a forgotten mailbox.
-
Mail gateway and content controls.Security architecture · Mail platform · Endpoint security
Attachment sandboxing with dynamic analysis. Link rewriting with time-of-click rescan. Impersonation protection. BEC anomalous-send detection. External sender flagging. Autoforward to external blocked.
-
Network defence layer.Network engineering · Security architecture · SOC
Egress filtering deny-by-default. DNS sinkholing for known indicators. Web filtering with category enforcement. Segmentation tested against assumed-breach. IDS and IPS rule-base reviewed, not left on factory defaults.
-
Endpoint hardening.Endpoint engineering · Security architecture · IT operations
Filetype blocking driven by the endpoint review of how each type behaves under duress, not a fixed list (currently ISO, LNK, HTA, OneNote, XLL on standard Windows builds; the list moves as adversary tooling moves). ASR rules deployed and audited. AppLocker or WDAC in enforced mode. Office macros blocked from internet zone. Local admin removed by default. Mark-of-the-Web propagating to archives.
-
OS and application hardening.Platform engineering · Security architecture · Application security
CIS benchmarks adopted and audited. Patching SLAs documented and met. Container image hardening with a refresh cadence. Runtime protection where the workload warrants. Application allow-listing where the threat model demands.
-
Identity and access.Identity · Security architecture · IT operations
MFA on every account, no exceptions. FIDO2 phishing-resistant for privileged users. Legacy authentication disabled. Conditional access with named locations. OAuth app consent governed. Privileged access workstations for tier zero. Just-in-time elevation.
-
Detection engineering and the SOC loop.SOC · Detection engineering · Security architecture · Security comms
Logging coverage audited against MITRE ATT&CK. SIEM correlation rules tested, not just deployed. One-button reporter in the mail client. Triage SLA documented and met. Loop closure on every report, with feedback to the reporter and IOCs feeding the next assessment scope.
-
Workforce education and reporting culture.Security comms · SOC (intel source) · People and HR · Awareness team
Education drawn from real reported lures, not vendor templates. Monthly comms that name the techniques landing against the sector. Reporting acknowledged within hours, every time. Just-in-time micro-prompts at the decision point, not quarterly modules. Success measured by reporting rate and time-to-report, not click rate. People are part of the defence, once the defence exists.
If you can't answer all three questions for each workstream above, there's room to move before the user is the right person to blame. The rest of the page is the case for why, and what to do instead.
Traditional simulated phishing.
Send a fake lure, count who clicks, send the clickers to training. That's been the default for twenty years. The market got to standard practice before anyone checked whether it works. Here's the case both ways.
- Vendor or in-house generates a phishing template, lands it in a defined population of inboxes.
- Tracking pixels and link-rewrites measure opens, clicks, and credential submissions.
- Users who click receive an immediate "you have been phished" interstitial and a training module.
- Aggregate metrics roll up to a quarterly click-rate dashboard for leadership.
The ETH Zurich CCS 2024 follow-up pinned whatever effect simulations have to the nudge itself, the periodic reminder that phishing is real, not to the training content. Most employees don't read the training.
Mature vendor platforms automate templating, scheduling, reporting, and even role-based targeting. Marginal cost per campaign is small. Capital outlay falls on the licence.
PCI-DSS, ISO 27001 Annex A, NIST CSF, HIPAA. All reference awareness training and the demonstration of such. A click-rate dashboard answers the question with a number, even if the number doesn't predict anything real.
"Click rate fell from 14% to 6% this year" reads as progress to a non-technical audience. The number is concrete, comparable across quarters, easy to chart. Whether it predicts a real-world outcome is a separate question. Boards rarely ask it.
A live simulation is a recurring touchpoint between the security team and the workforce. Run well, it normalises security comms. Run badly, it builds resentment.
The 2025 large-scale empirical assessment grounded in the NIST Phish Scale, with 12,511 participants, found no statistically significant impact of training modality on either click rates or reporting behaviour across the conditions tested.
The 14,000-employee, 15-month ETH Zurich study at IEEE S&P 2022 reported that embedded training as commonly deployed in industry does not make employees more resilient and, in places, made them more susceptible. This is now corroborated across multiple field studies.
Harder lures produce higher click rates, easier lures lower. No industry-standard methodology exists for calculating click rates. A team that wants a better number can simply pick easier lures next quarter and report improvement.
The UK NCSC has stated in its public guidance for over half a decade that punishment-oriented programmes suppress reporting: users who fear reprisals will not report mistakes promptly, if at all. The behaviour the security team needs most is the one the programme discourages.
A hospital case study found employee workload was the dominant predictor of phishing vulnerability. Staff intended to detect attacks but could not under load. Any programme that frames clicking as a knowledge or competence failure is testing the wrong thing.
Real operators use lures that work against the specific organisation: invoice fraud against finance, CV attachments against HR, internal-looking calendar invites against engineering. Vendor template libraries test against a threat that no longer exists in current operations.
"Sarah clicked again" is not a work order. It identifies no control gap, no configuration drift, no policy correction. The hours spent on the simulation programme are hours not spent on DMARC enforcement, attachment policy, FIDO2 rollout, or conditional access, which would.
Technical assessment of the email threat.
Three phases. Look at the endpoint's filetype handling. Look at the mail gateway and the defence layers behind it. Then take a working credential as given and work the success paths without bothering with social engineering. Solicitation is the awareness team's problem, not the assessor's.
What payloads can actually fire on this endpoint? Walk through the filetypes that show up in current adversary tooling and document how each behaves under duress on the standard build. The output is the filetype block list, refreshed as the threat moves.
- Filetype execution policy for ISO, LNK, HTA, SVG, CHM, XLL, OneNote
- Office macro policy, MOTW propagation, protected view
- ASR rules, SmartScreen, browser download handling
- AppLocker or WDAC posture, LOLBin reachability
- EDR detection coverage for known initial access techniques
- Local privilege boundaries, sudoers, UAC, autoelevation paths
The configuration review the vendor won't run on themselves. What actually lands in the inbox, what gets sandboxed, what gets stripped at the gateway.
- SPF, DKIM, DMARC enforcement posture (quarantine vs reject)
- Attachment sandboxing depth, dynamic analysis, file unwrap
- Link rewriting, time-of-click rescan, browser isolation handoff
- External sender flagging, impersonation protection, anti-spoof
- BEC detection, anomalous-send patterns, autoforwarding controls
- Transport rules, attachment block lists, executable handling
Credential in hand on day one. No social engineering, no waiting on a click. The question is what an attacker with creds can actually do once they're inside.
- MFA bypass viability: legacy auth, app passwords, device gaps
- Conditional access posture, named locations, session controls
- Token theft, AiTM proxy viability, primary refresh token reach
- OAuth consent attacks, app registration permissions
- Internal phish viability, mailbox rule abuse, SharePoint reach
- Data exfil paths, autoforward to external, eDiscovery export
"Block ISO and LNK at the gateway." "Move DMARC from quarantine to reject." "Disable legacy auth in conditional access." Every finding lands on someone with the authority and tooling to fix it. The remediation path is concrete and verifiable.
Credentials get stolen. Eventually. The defensible question is not whether a user will be phished, it is what an attacker with a working credential can do next. Assumed-compromise testing maps that surface directly to MITRE ATT&CK techniques the team can detect and contain.
Whether a user clicks is a function of workload, fatigue, and pretext quality, not of control posture. Decoupling the assessment from solicitation tests the controls cleanly. Solicitation is what the awareness team is for.
The controls a technical assessment exercises (filetype handling, gateway posture, MFA enforcement, conditional access) defend against phishing, smishing, drive-by download, malicious removable media, and supply-chain compromise. Awareness training defends only against the user-decision moment.
No employee is named, scored, or trained as a result of this assessment. Reporting culture is unaffected. The NCSC's stated concern about punitive simulation programmes is structurally avoided.
Every finding has a configuration artifact attached, every retest is a diff against a known state. Reports survive auditor scrutiny in a way that a click-rate trend graph does not.
The remediation backlog naturally lands on identity, mail gateway, and endpoint engineering, which is where the evidence base says investment produces resilience. Hours that would have gone on quarterly campaigns move to FIDO2 rollout, attachment policy, and conditional access.
Endpoint build review, gateway configuration audit, and assumed-compromise testing each require skilled hands. Mid-market organisations may need to engage an assessor. CHECK or CREST-aligned engagement is sensible for the assumed-compromise phase.
Skilled assessor days cost more than a SaaS simulation licence. The cost lands in a single procurement decision rather than spread across operations, which is harder to justify on a budget line even where total annual cost is comparable.
Some compliance frameworks ask whether a phishing simulation has been conducted in the prior period. A technical assessment is a stronger answer, but the box still has a specific name on it. The mitigation is to keep a lightweight awareness programme and align it with the assessment findings.
A workforce still needs to know what to report and how. The technical assessment removes the assessment from the workforce, but the reporting culture, the report button, and the response loop still need to be designed. NCSC's layered defence still applies.
Phishing as red team initial access.
Same logic for commissioned red team engagements. Phishing as the entry vector is the most expensive and least useful way to start. Assume-breach is the better default. Even the frameworks that used to insist on full-chain testing now say so.
same scope
compromise testing
SOC catches the lure
- Threat intelligence phase identifies plausible attacker pretexts and likely targets within the workforce.
- Infrastructure setup: registered domains, SSL, mail relay, payload development, evasion testing against the target's email security stack.
- Lures delivered to selected targets. The team waits for clicks, credential submissions, or sandboxed execution.
- Successful initial access feeds the post-compromise phase, with whatever time remains.
For regulated scenarios where the threat model explicitly includes external initial access, end-to-end testing answers a specific question. TIBER-EU's standard variant covers this. The trade-off is that the answer is rarely surprising.
The gateway, sandboxing, and impersonation controls get an end-to-end workout that a configuration review cannot fully replicate. The catch is that this is also achievable with a much narrower mail security assessment for a fraction of the cost.
CBEST and DORA both reference threat-intelligence-led testing that includes initial access scenarios. For financial entities subject to those regimes, full-chain testing is sometimes expected. The expectation is itself softening: TIBER variants now explicitly allow assume-breach.
Every credible threat report from Mandiant, CrowdStrike, and Verizon over the last decade shows that determined adversaries achieve initial access. Spending a quarter of a multi-week engagement re-establishing this point surfaces no actionable new finding. The defensible question is what happens after.
If the SOC catches the phish, which is the entire point of the awareness and email security programme, the red team has nowhere to go. The client paid for a multi-week engagement and got one finding about their mail filter.
Registered domains, SSL certificates, mail relay configuration, payload development, anti-detection tuning. None of it carries to the next engagement. One to two weeks of senior engineer time goes on building the entry vehicle rather than on testing the client's controls.
"We phished a finance contractor" tells you very little about whether the same approach would work against a different role next quarter. "Lateral movement from a low-priv user succeeded via SeImpersonatePrivilege abuse" tells you everything, because the technique applies to every user with that privilege.
Whether a user clicks depends on their workload, fatigue, pretext quality, and the time of day the lure lands. The same engagement run twice produces different results. Assumed-breach starts from the same position every time, which makes year-over-year improvement measurable.
A red team phish lands in the same inboxes that the awareness team is trying to measure separately. The signals contaminate each other. Real phishing reports become indistinguishable from red team simulations, and trend data gets noisy.
In a six-week engagement, post-compromise work might get two to three weeks. The same two to three weeks against the same internal estate, with a foothold provided on day one, is the assumed-breach engagement. The client pays roughly a third more for the privilege of seeing the email security stack tested in the same engagement, which a much cheaper assessment could have covered.
- The client provides a low-privilege foothold. A standard user account, an unprivileged workstation in the target environment, or both.
- The engagement starts on day one in the post-compromise position. No infrastructure setup, no waiting for clicks.
- Lateral movement, privilege escalation, persistence, and impact testing proceed against the real internal control plane.
- Findings map directly to MITRE ATT&CK techniques the SOC can detect and contain. The blue team gets actionable improvement work.
No infrastructure setup, no lure development, no waiting on clicks. The full window goes to lateral movement, privilege escalation, persistence, and impact. The deliverable density is higher per pound spent.
Same starting position, same scope, same measurement framework. Improvement actually shows up. Two consecutive phishing-led engagements with different lure performance tell you very little about whether the internal controls got better.
TIBER variants now formally allow the initial entry phase to be skipped. Operational reality has caught up with the methodology. CBEST has long permitted intelligence-led scenarios that begin from documented footholds. The framework consensus is moving in this direction.
In phishing-led testing, detection is the failure condition. In assumed-breach, detection is a finding to celebrate and refine. The blue team's catches become rule-tuning material in the closing purple-team session.
"T1078.002 Domain Accounts," "T1558.003 Kerberoasting," "T1484.001 Group Policy Modification." Each finding lands in a framework the SOC already uses for detection engineering. The remediation backlog writes itself.
No infrastructure cost, shorter engagement window, no failure-mode rework. The savings come out of the part of the budget that was producing the least new information anyway.
The workforce is untouched. The phishing-report channel keeps its signal-to-noise ratio. The two programmes (red team and awareness) can run on independent schedules without interfering with each other's metrics.
The email security stack, perimeter web exposure, and edge identity controls go untested in a pure assumed-breach engagement. The mitigation is to commission separate, narrower assessments for those (which is also cheaper than rolling them into the red team scope).
Some procurement and risk functions struggle with this. The provided credential needs scoping, time-boxing, and revocation procedures. This is a process problem, not a methodology problem, and it solves itself once a single engagement has been run through it.
Senior stakeholders who think "red team" means "we tested whether attackers could get in" may need re-education. The honest answer (they will, eventually, and what matters is what happens next) is harder to put on a slide than a click-rate trend graph.
Assumed-breach is most valuable when the SOC and detection engineering function are mature enough to absorb the findings and improve. For organisations without that capability, the engagement still produces a remediation backlog, but the deeper purple-team value is harder to realise.
The SOC feedback loop.
A good phish that gets past the gateway, the sandbox, the link rewriter, and the EDR, and lands in someone's inbox, isn't a control failure. It's free, current, targeted threat intel about what's working against you right now. Most programmes have nowhere to put it. Without a loop, everything else is guesswork.
The lure landed. The user now has a sample of what just bypassed everything you've got.
SOC picks it up, decides benign, suspicious, or malicious. Real adversary stuff jumps the queue.
IOCs, TTPs, sender infrastructure, attachment behaviour, link patterns, who got targeted. Pulled out in a structured format that the rest of the stack can ingest.
Block what landed. Block what looks like it. Hunt the inbox set and the SIEM history for earlier related deliveries that nobody reported.
Tell the reporter what their report did. Update the threat model. Share with peers through an ISAC, sector group, or NCSC where you can. Feed the IOC and TTP set into the next assumed-breach engagement so it's testing against the real adversary.
Side by side on the dimensions that matter.
Click any row to expand the definition. Dot colour shows relative strength of each approach on that dimension.
Where does your programme sit on the spectrum?
Nine questions, scored 0 to 3 each. Find out where the programme sits between simulation theatre and assessment-led resilience with a closed loop. Responses are captured locally, aggregated only if you've wired analytics in.
Answer straight. The score is for you, not an auditor.
References
- Phishing in Organizations: Findings from a Large-Scale and Long-Term Study. IEEE Symposium on Security and Privacy (S&P), 2022. 14,000 employees, 15 months. arxiv.org/pdf/2112.07498
- Content, Nudges and Incentives: A Study on the Effectiveness and Perception of Embedded Phishing Training. ACM CCS 2024. Distinguished Paper Award. 4,554 participants. arxiv.org/abs/2409.01378
- A Large-Scale Empirical Assessment of Multi-Modal Training Grounded in the NIST Phish Scale. arXiv preprint, 2025. 12,511 participants. arxiv.org/html/2506.19899v1
- Phishing simulation exercise in a large hospital: a case study. PubMed Central. Workload-driven susceptibility analysis. ncbi.nlm.nih.gov/pmc/articles/PMC8935590
- Phishing attacks: defending your organisation. Official guidance, multi-layer mitigation framework, updated 2025. ncsc.gov.uk/guidance/phishing
- TIBER-EU: Threat Intelligence-based Ethical Red Teaming. Framework for controlled, intelligence-led red team testing of financial entities, supporting DORA compliance. ecb.europa.eu/paym/cyber-resilience/tiber-eu
- TIBER-NL and TIBER-Rijk assume-breach variants. Explicit framework provision for skipping the initial entry phase where operationally or legally appropriate. tiber.info/documentation
- Threat-Informed Defense. Continuous feedback loop between adversary behaviour, control visibility, and operational readiness. ATT&CK is the underlying taxonomy. mitre.org/focus-areas/cybersecurity/threat-informed-defense