SOC Operations Framework | Security Guide

SOC Mission and Value Proposition#

A Security Operations Center (SOC) serves as the centralized function responsible for continuous monitoring, detection, analysis, and response to cybersecurity threats. An effective SOC transforms security from a reactive checkbox exercise into a proactive, intelligence-driven capability that protects the organization's most critical assets.

A centralized team and technology platform responsible for 24/7 monitoring, detection, analysis, investigation, and response to cybersecurity incidents. The SOC combines people, processes, and technology to provide continuous security oversight.

Core SOC Missions

Continuous Monitoring

Maintain 24/7/365 visibility across the enterprise technology stack, including networks, endpoints, cloud infrastructure, applications, and identity systems. Aggregate and correlate security events to identify anomalies and potential threats.

Threat Detection & Analysis

Develop and deploy detection logic to identify malicious activity, policy violations, and security control failures. Analyze alerts to distinguish true positives from false positives and determine threat severity.

Incident Response

Execute coordinated response procedures to contain, eradicate, and recover from security incidents. Minimize business impact through rapid triage, escalation, and remediation.

Threat Intelligence

Consume, analyze, and operationalize threat intelligence to improve detection capabilities and inform proactive defense strategies. Share intelligence across the organization and with external partners.

Security Posture Management

Monitor compliance with security policies, identify control gaps, track vulnerability remediation, and provide metrics that demonstrate security effectiveness to leadership.

Business Value Realization: A mature SOC reduces Mean Time to Detect (MTTD) from months to hours, Mean Time to Respond (MTTR) from weeks to minutes, and provides measurable risk reduction through continuous improvement cycles. Organizations with effective SOCs experience 60-80% reduction in breach costs compared to those without dedicated security operations.

When to Build a SOC

Not every organization needs a full-scale SOC immediately. Consider these indicators:

Regulatory requirements mandate continuous monitoring (PCI-DSS, HIPAA, GDPR)
Organization size exceeds 500 employees or handles sensitive customer data
Technology complexity includes multi-cloud, hybrid infrastructure, or critical OT/ICS systems
Threat landscape includes targeted attacks, nation-state threats, or high-value intellectual property
Incident history shows recurring breaches or slow incident response

SOC Operating Models#

Organizations must choose a SOC operating model that aligns with their budget, staffing capabilities, regulatory requirements, and risk tolerance. Each model offers distinct advantages and trade-offs in cost, control, expertise, and scalability.

Detail Level

In-House SOC (Fully Internal)

Advantages

•Complete control over people, processes, and technology
•Deep organizational knowledge and context-aware analysis
•No third-party access to sensitive security data
•Rapid communication with internal stakeholders
•Custom detection logic tailored to unique risks

Challenges

•High upfront capital and ongoing operational costs
•Difficulty recruiting and retaining skilled analysts (25-30% annual turnover)
•24/7 coverage requires 6-8 FTEs minimum
•Technology sprawl and integration complexity
•Limited exposure to diverse threat patterns across industries

⚠️

Recommended For: Large enterprises (5,000+ employees), highly regulated industries (financial services, healthcare), organizations with unique security requirements or high-value intellectual property, or those with sufficient budget ($1.5M+ annually).

Organizational Structure and Roles#

Effective SOCs employ a tiered analyst model that balances efficiency, expertise development, and career progression. This structure ensures rapid triage of high-volume alerts while reserving senior talent for complex investigations and strategic initiatives.

A hierarchical structure where analysts are organized by skill level and responsibility: Tier 1 (alert triage), Tier 2 (incident investigation), Tier 3 (threat hunting and detection engineering). This model optimizes resource allocation and provides clear career progression paths.

Tier 1: Security Analyst (Alert Triage)

Entry Level

Primary Responsibilities

•Monitor security alerts from SIEM, EDR, IDS/IPS, email security
•Perform initial triage to classify alerts as true positive, false positive, or benign
•Execute predefined playbooks for common scenarios (phishing, malware, failed logins)
•Document findings in ticketing system with supporting evidence
•Escalate confirmed threats to Tier 2 with context and initial analysis
•Assist with vulnerability scanning and basic remediation tracking

Required Skills & Qualifications

•Education: Bachelor's in IT, Cybersecurity, or equivalent experience
•Certifications: Security+, CySA+, or GIAC GSEC recommended
•Technical: Basic networking (TCP/IP, DNS, HTTP), Windows/Linux fundamentals, log analysis
•Tools: SIEM query languages (SPL/KQL/Lucene), ticketing systems, EDR consoles

Performance Metrics

•Alert closure rate: 20-30 alerts/shift (varies by environment)
•False positive identification accuracy: >90%
•Escalation quality: <10% escalations returned due to insufficient context
•Time to triage: <15 minutes for medium severity, <5 minutes for critical

Tier 2: Incident Responder (Investigation)

Intermediate

Primary Responsibilities

•Conduct in-depth investigations of escalated incidents
•Perform forensic analysis (memory, disk, network captures)
•Coordinate containment and remediation activities with IT teams
•Develop indicators of compromise (IOCs) for detection improvements
•Author post-incident reports with root cause analysis and recommendations
•Mentor Tier 1 analysts and improve playbook quality

Required Skills & Qualifications

•Experience: 2-4 years in SOC or security operations role
•Certifications: GCIH, GCIA, CEH, or equivalent incident response credentials
•Technical: Advanced log analysis, malware analysis basics, network forensics, scripting (Python/PowerShell)
•Frameworks: MITRE ATT&CK, Cyber Kill Chain, NIST IR lifecycle

Performance Metrics

•Mean Time to Respond (MTTR): <4 hours for high severity, <1 hour for critical
•Investigation quality: Peer review score >85%
•Containment effectiveness: <5% re-infection rate within 30 days
•Knowledge contribution: 2+ playbook improvements or new detections per quarter

Tier 3: Threat Hunter / Detection Engineer

Expert

Primary Responsibilities

•Conduct proactive threat hunting campaigns to identify undetected threats
•Design and implement advanced detection logic (correlation rules, behavioral analytics)
•Perform adversary emulation (purple team exercises)
•Research emerging threats and translate to actionable defenses
•Optimize SIEM performance and reduce false positives
•Lead major incident response for APT or ransomware campaigns

Required Skills & Qualifications

•Experience: 5+ years in security operations, incident response, or threat intelligence
•Certifications: GIAC GCFA/GREM, OSCP, CISSP, or SANS FOR508/SEC504
•Technical: Advanced malware analysis, reverse engineering, threat modeling, data science (UEBA/ML)
•Programming: Python, PowerShell, Sigma, YARA, KQL/SPL mastery

Performance Metrics

•Threat hunting yield: 1+ confirmed compromise per quarter from proactive hunts
•Detection development: 5+ high-fidelity detections per quarter
•False positive reduction: 20% improvement annually through tuning
•Purple team outcomes: Detection coverage increase by 15% per exercise

Staffing Ratio Guidelines: For 24/7 coverage with redundancy, plan for 6-8 FTEs per tier level (accounting for vacation, training, turnover). A balanced SOC typically maintains a 4:2:1 ratio (Tier 1:Tier 2:Tier 3). Example: 24 Tier 1, 12 Tier 2, 6 Tier 3 analysts for enterprise-scale operations.

SOC Technology Stack#

A modern SOC requires integrated technologies across detection, investigation, response, and intelligence domains. Platform selection should prioritize integration capabilities, scalability, and analyst-friendly workflows over feature checklists.

Detail Level

Core Technology Components

SIEM (Security Information and Event Management)

SIEM

Key Capabilities:

•Ingest 500GB-50TB+ logs daily from diverse sources (Windows, Linux, network, cloud)
•Real-time correlation engine with sub-second latency for critical rules
•Advanced analytics (UEBA, ML-based anomaly detection)
•Pre-built content libraries (use cases, dashboards, reports)
•Investigation workflows with case management
•Long-term retention (1-2 years hot, 3-7 years cold/archival)

Platform Examples:

•Splunk Enterprise Security: Market leader, extensive ecosystem, high cost
•Microsoft Sentinel: Cloud-native, tight Azure integration, consumption-based pricing
•Elastic Security (ELK): Open-source core, flexible, requires more in-house expertise
•IBM QRadar: Strong compliance features, traditional enterprise focus
•Chronicle (Google): Massive scale, unique architecture, newer to market

EDR (Endpoint Detection and Response)

EDR

Key Capabilities:

•Behavioral analytics to detect fileless malware, ransomware, and living-off-the-land attacks
•Process-level telemetry with full command-line visibility
•Automated response actions (isolate host, kill process, quarantine file)
•Threat hunting interface with timeline reconstruction
•Integration with threat intelligence feeds for real-time IOC matching

Platform Examples:

•CrowdStrike Falcon: Cloud-native, lightweight agent, strong threat intelligence
•Microsoft Defender for Endpoint: Deep Windows integration, included in M365 E5
•SentinelOne: Autonomous response, AI-driven, strong Mac/Linux support
•Carbon Black (VMware): Extensive telemetry capture, strong forensic capabilities

NDR (Network Detection and Response)

NDR

Key Capabilities:

•Deep packet inspection (DPI) with protocol analysis
•Machine learning-based anomaly detection (unusual traffic volumes, new protocols)
•Encrypted traffic analysis (TLS fingerprinting, certificate inspection)
•Asset discovery and network mapping
•PCAP capture for forensic investigation

Platform Examples:

•Darktrace: AI-driven, autonomous response, self-learning
•Vectra AI: Focus on hybrid/cloud environments, prioritized threat scoring
•ExtraHop Reveal(x): Wire-data analytics, strong forensics
•Corelight (Zeek-based): Open-source foundation, flexible deployment

⚠️

Integration is Critical: SIEM, EDR, and NDR must share telemetry bidirectionally. EDR/NDR alerts should trigger SIEM correlation; SIEM detections should enable automated EDR containment. Plan for API integrations, standardized log formats, and unified case management from day one.

Detection Engineering#

Detection engineering is the discipline of designing, implementing, testing, and maintaining detection logic that identifies malicious activity with high fidelity. Effective detection engineering balances coverage (detecting threats) with precision (minimizing false positives).

Detection Engineering

Use Case Development

Identify detection opportunities from threat intelligence, compliance requirements, and organizational risks

Use Case Identification Sources

•Threat Intelligence: Emerging campaigns, TTPs from threat reports (MITRE ATT&CK mapping)
•Regulatory Requirements: PCI-DSS 10.6.1 (review logs daily), HIPAA audit controls
•Incident History: Past breaches or near-misses in your organization
•Red Team Findings: Techniques used in penetration tests or purple team exercises
•Asset Criticality: High-value systems requiring enhanced monitoring (CFO laptop, domain controllers)

Use Case Template

Name: Detect Kerberoasting Activity

Objective: Identify attempts to extract service account credentials via Kerberos TGS requests

MITRE ATT&CK: T1558.003 (Steal or Forge Kerberos Tickets: Kerberoasting)

Data Sources: Windows Event ID 4769 (TGS request), filtering for RC4 encryption and unusual SPNs

Detection Logic: Alert when single user requests 5+ TGS tickets within 10 minutes using RC4 encryption

False Positive Scenarios: Legitimate service accounts, automated inventory scans

Response Actions: Investigate user activity, check for lateral movement, reset service account passwords

Detection Development

Translate use cases into executable detection logic using appropriate methodologies

Detection Methodologies

Signature-Based Detection

Match known patterns: file hashes, IP addresses, domains, regex patterns. Fast and precise but requires prior knowledge of threat.

Example: Alert on execution of file with SHA256 hash matching known Emotet variant.

Behavioral Analytics

Detect anomalies in user/system behavior: unusual login times, abnormal process execution, baseline deviations. Catches novel threats but higher false positive rate.

Example: Alert when user accesses 10x more files than their 30-day average within 1 hour.

Correlation Rules

Combine multiple low-fidelity signals into high-fidelity alert: failed login + successful login + privilege escalation = credential compromise.

Example: Alert when user has 5+ failed RDP logins followed by successful login within 5 minutes.

Threat Intelligence Matching

Compare events against external threat feeds: STIX/TAXII IOCs, reputation databases. Relies on intelligence quality and timeliness.

Example: Alert on DNS query to domain in APT28 C2 infrastructure list.

Detection-as-Code Practices

•Store detection rules in version control (Git)
•Use platform-agnostic formats (Sigma, YARA) when possible
•Implement CI/CD pipelines for testing and deployment
•Document detection logic with inline comments and README files
•Tag rules with metadata (MITRE ATT&CK, severity, author)

Testing & Validation

Validate detection logic against benign activity and known attack patterns before production deployment

Testing Methodologies

Unit Testing (Benign Baseline)

Run detection against historical "clean" data to measure false positive rate. Target: <5 false positives per 1,000 events for high-severity rules.

Method: Query last 30 days of logs matching detection criteria. Manually review results to identify legitimate activity triggering alert.

Attack Simulation (True Positive Validation)

Execute actual attack technique in controlled environment to confirm detection fires. Use frameworks like Atomic Red Team or CALDERA.

Example: Run Invoke-Kerberoast.ps1 in lab domain to validate Kerberoasting detection.

Purple Team Exercises

Coordinate with red team to execute TTPs. Blue team validates detection, measures MTTD, and improves response playbooks.

Outcome: Detection gap analysis showing which techniques were/weren't detected.

Validation Checklist

☐Detection fires on known-good attack simulation
☐False positive rate <5% on historical data
☐Alert contains sufficient context for triage
☐Performance impact <5% query latency
☐Peer review completed by senior detection engineer
☐Documentation updated (playbook, runbook, MITRE mapping)

Deployment & Tuning

Deploy validated detections to production and continuously tune based on operational feedback

Deployment Strategy

•Staged Rollout: Deploy to subset of assets (e.g., 10% of endpoints) for 7 days
•Monitor Mode: Generate alerts but don't trigger automated response actions initially
•Analyst Training: Brief SOC team on new detection, expected alerts, response procedures
•Change Control: Document deployment in change management system with rollback plan

Continuous Tuning Process

1.Weekly Review: Analyze all alerts from detection, classify as TP/FP/benign
2.False Positive Analysis: Identify root cause (legitimate software, misconfigured threshold)
3.Rule Refinement: Add exclusions, adjust thresholds, or improve correlation logic
4.Validation: Re-test tuned rule against both benign and malicious datasets
5.Documentation: Update rule changelog with tuning rationale

⚠️

Tuning Anti-Pattern: Avoid "tuning by suppression"—blindly adding exclusions without understanding root cause. This creates detection blind spots. Every exclusion should be documented with business justification and expiration date.

Incident Response Procedures#

Incident response procedures define how the SOC triages, escalates, investigates, contains, and remediates security incidents. Effective procedures balance speed (minimizing dwell time) with thoroughness (preserving forensic evidence and preventing recurrence).

Mean Time to Detect (MTTD)Mean Time to Respond (MTTR)

Alert Triage & Classification

Rapid assessment to determine if alert represents genuine security incident

Triage Decision Tree

True Positive (Confirmed Threat)

Alert represents actual malicious activity. Examples: Known malware hash, confirmed C2 communication, unauthorized privilege escalation.

Action: Escalate to Tier 2 for investigation. Create incident ticket. Begin containment if critical severity.

Suspicious (Requires Investigation)

Alert shows anomalous behavior but lacks definitive indicators of compromise. Examples: Unusual login location, new process execution, unexpected network connection.

Action: Perform initial enrichment (user context, asset criticality, recent activity). Escalate if risk indicators present.

Benign (Authorized Activity)

Alert triggered by legitimate business activity. Examples: Authorized penetration test, approved maintenance, known software behavior.

Action: Document as benign with justification. Consider tuning detection to exclude this scenario.

False Positive (Detection Error)

Alert incorrectly classified benign activity as malicious. Examples: Misconfigured threshold, overly broad signature, data parsing error.

Action: Document as false positive. Submit to detection engineering for tuning. Track FP rate for quality metrics.

Triage SLA Targets

•Critical: Initial triage within 5 minutes, classification within 15 minutes
•High: Initial triage within 15 minutes, classification within 30 minutes
•Medium: Initial triage within 1 hour, classification within 4 hours
•Low: Initial triage within 24 hours, classification within 72 hours

Incident Investigation

In-depth analysis to determine scope, impact, and root cause of confirmed incidents

Investigation Framework (5 W's)

What Happened?

Identify specific malicious actions: malware executed, data accessed, credentials compromised, systems affected.

Data Sources: EDR telemetry, SIEM correlation, network flow data, authentication logs

Who Is Affected?

Determine impacted users, systems, and data. Assess asset criticality and data sensitivity.

Data Sources: CMDB, asset inventory, data classification database, identity management system

When Did It Occur?

Establish timeline: initial access, persistence establishment, privilege escalation, lateral movement, data exfiltration.

Method: Correlate event timestamps across data sources. Identify earliest indicator (initial compromise date).

Where Did It Spread?

Map lateral movement paths, identify compromised systems, assess blast radius.

Indicators: Shared credentials across systems, unusual network connections, similar malware artifacts on multiple hosts

Why Did Defenses Fail?

Root cause analysis: control gap (no EDR on server), detection gap (missed technique), response gap (delayed containment).

Outcome: Findings inform improvement roadmap (new detections, additional controls, process changes)

Evidence Collection Best Practices

•Preserve volatile data first (memory dumps, running processes)
•Maintain chain of custody for forensic integrity
•Use write-blockers for disk imaging to prevent modification
•Hash all evidence files (SHA256) for integrity verification
•Store evidence in secure repository with access controls

Containment & Eradication

Isolate affected systems and remove attacker access to prevent further damage

Containment Strategies

Network Isolation

Disconnect compromised systems from network via EDR isolation, VLAN change, or physical disconnection. Preserves evidence while preventing lateral movement.

Use When: Active attacker presence, suspected C2 communication, or rapid spread observed

Account Suspension

Disable compromised user accounts, reset passwords, revoke access tokens. Prevents credential-based lateral movement.

Use When: Credential theft confirmed, unauthorized access via valid credentials, or insider threat suspected

IP/Domain Blocking

Add malicious IPs/domains to firewall deny lists, DNS sinkhole, or web proxy blocks. Disrupts C2 communication and prevents reinfection.

Use When: Known C2 infrastructure identified, malware download sites discovered, or phishing domains detected

Surgical Remediation

Remove specific malware artifacts, kill malicious processes, delete persistence mechanisms without full system rebuild. Faster but higher reinfection risk.

Use When: Business-critical system can't be offline, malware is well-understood, or full forensic imaging is complete

Eradication Checklist

☐All malware binaries removed from affected systems
☐Persistence mechanisms eliminated (scheduled tasks, registry keys, services)
☐Compromised credentials reset (passwords, SSH keys, API tokens)
☐Lateral movement paths closed (firewall rules, network segmentation)
☐Vulnerability patched or mitigated (if exploited for initial access)
☐Monitoring enhanced for reinfection indicators (IOCs, behavioral patterns)

⚠️

Containment Trade-offs: Aggressive containment (full network isolation) minimizes damage but may disrupt business operations. Coordinate with business stakeholders to balance risk reduction with operational impact. For ransomware or data exfiltration, prioritize rapid containment over business continuity.

Recovery & Validation

Restore systems to known-good state and verify attacker has been completely removed

Recovery Procedures

System Rebuild (Gold Standard)

Reimage from clean baseline, reinstall applications, restore data from pre-infection backups. Eliminates all attacker artifacts with high confidence.

Best For: Critical systems, confirmed rootkit/firmware compromise, or when eradication confidence is low

In-Place Remediation

Remove malware, patch vulnerabilities, harden configuration without rebuilding. Faster but requires thorough validation.

Best For: Non-critical systems, well-understood threats, or when downtime is unacceptable

Validation Testing

•IOC Sweep: Scan all systems for known indicators of compromise (hashes, IPs, domains)
•Behavioral Monitoring: Watch for reinfection patterns (unusual processes, network connections) for 72+ hours
•Credential Verification: Confirm password resets completed, MFA enforced, access tokens revoked
•Vulnerability Re-scan: Verify exploited vulnerabilities are patched across all affected systems
•Business Function Testing: Validate critical applications and services are operational

✓

Recovery Sign-off: Require formal approval from SOC Manager, IT Operations, and business stakeholder before declaring incident fully resolved. Document validation testing results and residual risks in post-incident report.

Post-Incident Activities

Document lessons learned and implement improvements to prevent recurrence

Post-Incident Report Contents

•Executive Summary: High-level overview for non-technical stakeholders (what, when, impact, resolution)
•Timeline: Chronological sequence of events from initial compromise to resolution
•Root Cause Analysis: How attacker gained access, why defenses failed, contributing factors
•Impact Assessment: Systems affected, data compromised, business disruption, financial cost
•Response Effectiveness: MTTD, MTTR, what worked well, what needs improvement
•Recommendations: Prioritized action items to prevent recurrence (technical controls, process changes, training)

Lessons Learned Session

Conduct blameless retrospective within 7 days of incident closure with all responders:

•What went well during response?
•What could have been done better/faster?
•Were playbooks accurate and helpful? What's missing?
•Did we have the right tools and access? What was lacking?
•How effective was communication with stakeholders?

Improvement Tracking

Convert recommendations into actionable tasks with ownership and deadlines:

[CRITICAL] Deploy EDR to all servers (Owner: IT Ops, Due: 14 days)

[HIGH] Implement detection for lateral movement technique used (Owner: Detection Eng, Due: 30 days)

[MEDIUM] Update phishing playbook with new procedures learned (Owner: SOC Manager, Due: 7 days)

[LOW] Conduct tabletop exercise for ransomware scenario (Owner: SOC Manager, Due: 90 days)

Playbooks and Runbooks#

Playbooks and runbooks standardize incident response procedures, reduce analyst decision fatigue, and ensure consistent handling of common scenarios. Playbooks provide strategic guidance ("what to do and why"), while runbooks offer tactical step-by-step instructions ("how to do it").

PlaybookRunbook

Detail Level

Essential Playbooks (Core 5)

1. Phishing Response Playbook

Trigger: User reports suspicious email, email security alert, credential harvesting attempt detected

Objectives: Validate phishing attempt, identify affected users, prevent credential compromise, remove malicious emails

Key Actions:

1.Analyze email headers, links, attachments for malicious indicators
2.Query email gateway for similar messages delivered to other users
3.Purge malicious emails from all mailboxes
4.Block sender domain/IP at email gateway and firewall
5.Force password reset for users who clicked links or entered credentials
6.Monitor for account compromise indicators (unusual logins, mailbox rules)

Escalation Criteria: Executive targeting (CEO/CFO), confirmed credential entry, malware execution detected

2. Malware Incident Playbook

Trigger: EDR malware alert, antivirus detection, suspicious process execution, file hash match to known malware

Objectives: Contain malware spread, determine malware family and capabilities, eradicate from all systems, identify initial infection vector

Key Actions:

1.Isolate infected system via EDR or network disconnection
2.Collect malware sample and submit to sandbox for analysis
3.Extract IOCs (file hashes, registry keys, network indicators)
4.Hunt for IOCs across enterprise (EDR, SIEM, NDR)
5.Block C2 domains/IPs at firewall and DNS
6.Remediate all infected systems (reimage or remove malware)
7.Investigate initial access vector (email, download, exploit, removable media)

Escalation Criteria: Ransomware indicators, widespread infection (>10 systems), critical system affected, data exfiltration detected

3. Account Compromise Playbook

Trigger: Impossible travel alert, unusual login location, multiple failed logins followed by success, privileged access abuse

Objectives: Terminate unauthorized sessions, secure compromised account, assess attacker actions, prevent lateral movement

Key Actions:

1.Immediately disable compromised account
2.Terminate all active sessions for the account
3.Revoke access tokens, API keys, and MFA enrollments
4.Review account activity logs (file access, email sent, privilege changes)
5.Check for persistence mechanisms (mailbox rules, delegations, OAuth grants)
6.Hunt for lateral movement using compromised credentials
7.Coordinate password reset with user (out-of-band verification)

Escalation Criteria: Privileged account (admin, service account), sensitive data accessed, evidence of lateral movement

4. Data Exfiltration Playbook

Trigger: Large volume data transfer, upload to cloud storage, unusual file access patterns, DLP policy violation

Objectives: Identify exfiltrated data, determine attacker destination, assess business impact, prevent additional exfiltration

Key Actions:

1.Block destination IP/domain at firewall and web proxy
2.Isolate source system to prevent continued exfiltration
3.Identify files/data transferred (file names, size, classification)
4.Review file access logs to determine scope (what was accessed vs. transferred)
5.Assess data sensitivity (PII, IP, financials) for breach notification requirements
6.Notify legal/compliance for regulatory obligations (GDPR, CCPA, HIPAA)
7.Hunt for similar exfiltration attempts across environment

Escalation Criteria: Confirmed exfiltration of classified/regulated data, >100GB transferred, customer data involved

5. Ransomware Response Playbook

Trigger: Mass file encryption, ransom note detected, backup deletion, shadow copy deletion

Objectives: Immediate containment to prevent spread, assess impact and recovery options, restore from backups, notify stakeholders

Key Actions:

1.IMMEDIATELY isolate all infected systems (network disconnect preferred over EDR isolation)
2.Disable compromised accounts used for lateral movement
3.Protect backups (isolate backup infrastructure, verify integrity)
4.Identify ransomware variant (from ransom note, file extensions, behavior)
5.Assess encryption scope (how many systems, what data)
6.Notify executive leadership, legal, PR, cyber insurance
7.Determine recovery strategy (restore from backups vs. rebuild)
8.Hunt for initial access vector and persistence to prevent reinfection

Critical Decision Point:

Do NOT pay ransom without executive approval, legal consultation, and cyber insurance coordination. Payment does not guarantee decryption and may fund future attacks. Prioritize backup restoration.

SOC Metrics and KPIs#

SOC metrics provide data-driven insights into operational performance, threat landscape trends, and program maturity. Effective metrics balance operational efficiency (speed, volume) with security effectiveness (detection quality, impact reduction).

Metrics Philosophy: Measure what matters, not just what's easy to measure. Focus on metrics that drive decision-making and continuous improvement. Avoid vanity metrics (alert volume) in favor of outcome metrics (MTTD, MTTR, prevention rate).

Operational Efficiency Metrics

Mean Time to Detect (MTTD)

Average time between when an attack begins and when it is detected by security controls.

Calculation: Sum(Detection Timestamp - Attack Start Timestamp) / Number of Incidents

Industry Benchmark: 21 days (2023 Mandiant M-Trends)

Target: Mature SOC: <24 hours, Advanced: <4 hours

Improvement Drivers: Better detection coverage, behavioral analytics, threat hunting

Mean Time to Respond (MTTR)

Average time between detection and successful containment/ remediation.

Calculation: Sum(Containment Timestamp - Detection Timestamp) / Number of Incidents

Targets (by severity):

• Critical: <1 hour
• High: <4 hours
• Medium: <24 hours
• Low: <72 hours

Improvement Drivers: Automation (SOAR), playbook optimization, analyst training

Mean Time to Triage (MTTT)

Average time between alert generation and initial triage (TP/FP classification).

Calculation: Sum(Triage Timestamp - Alert Timestamp) / Number of Alerts

Targets (by severity):

• Critical: <15 minutes
• High: <30 minutes
• Medium: <4 hours

Improvement Drivers: Alert enrichment, context automation, analyst staffing

Alert Volume & Closure Rate

Daily/weekly alert volume and percentage of alerts closed within SLA.

Calculation: (Alerts Closed Within SLA / Total Alerts) × 100

Target: >95% closure rate, stable or decreasing volume over time

Warning Signs: Increasing alert backlog, declining closure rate, excessive overtime

Detection Quality Metrics

False Positive Rate

Percentage of alerts that are not genuine threats.

Calculation: (False Positive Alerts / Total Alerts) × 100

Target: <30% overall, <10% for critical severity detections

Impact: High FP rate causes analyst burnout, alert fatigue, and missed real threats

Improvement Drivers: Detection tuning, machine learning, contextual enrichment

True Positive Rate (Detection Sensitivity)

Percentage of actual attacks that are detected.

Calculation: (Detected Attacks / Total Attacks) × 100

Target: >90% for known attack techniques, >70% for novel techniques

Measurement Method: Purple team exercises, red team engagements, attack simulations

Detection Coverage (MITRE ATT&CK)

Percentage of applicable ATT&CK techniques with documented detection logic.

Calculation: (Techniques with Detection / Applicable Techniques) × 100

Target: >70% coverage for applicable techniques

Visualization: ATT&CK Navigator heatmap showing coverage gaps

Alert Fidelity Score

Percentage of alerts that lead to actionable investigations.

Calculation: (True Positives + Suspicious Requiring Investigation) / Total Alerts × 100

Target: >50% for mature detection programs

Quality Indicator: High fidelity = analysts spend time on real threats, not noise

Analyst Performance Metrics

Alerts Handled per Analyst per Shift

Average number of alerts triaged or investigated per analyst per 8-hour shift.

Calculation: Total Alerts Handled / (Number of Analysts × Number of Shifts)

Typical Range:

• Tier 1 (Triage): 20-30 alerts/shift
• Tier 2 (Investigation): 5-10 incidents/shift
• Tier 3 (Hunting): 1-2 campaigns/week

Use Cases: Capacity planning, workload balancing, identifying burnout risk

Escalation Quality

Percentage of Tier 1 escalations accepted by Tier 2 (accurate vs. returned for more triage).

Calculation: (Accepted Escalations / Total Escalations) × 100

Target: >90% acceptance rate

Quality Indicator: Low score indicates insufficient Tier 1 analysis or unclear escalation criteria

Documentation Quality Score

Percentage of incident tickets with complete documentation (findings, actions, evidence).

Calculation: (Tickets with Complete Docs / Total Tickets) × 100

Target: >95% for closed incidents

Assessment Method: Random sampling with peer review checklist

Business Impact Metrics

Prevented Loss (Cost Avoidance)

Estimated financial loss prevented through early detection and response.

Calculation: Sum of (Incident Severity Score × Average Breach Cost for Severity)

Example Framework:

• Critical incident prevented: $500K - $2M
• High incident prevented: $100K - $500K
• Medium incident prevented: $25K - $100K

Use Case: Demonstrate SOC ROI to leadership, justify budget requests

Incident Impact Scope

Average number of systems/users affected per incident.

Calculation: Sum(Affected Assets) / Number of Incidents

Target: Decreasing trend over time (faster containment)

Quality Indicator: Effective lateral movement prevention reduces blast radius

Compliance Adherence

Percentage of regulatory requirements met by SOC operations.

Examples:

• PCI-DSS 10.6.1: Review logs daily (100% adherence = no missed days)
• HIPAA: Report breaches within 60 days (100% adherence = all reported on time)

Target: 100% adherence to critical regulatory requirements

Shift Operations and Coverage Models#

Effective 24/7 SOC operations require careful shift scheduling, robust handoff procedures, and proactive fatigue management. The goal is consistent coverage with minimal analyst burnout and maximum operational continuity.

24/7 Coverage Models

Model 1: 8-Hour Shifts (Traditional)

Schedule: Three 8-hour shifts: Day (7am-3pm), Swing (3pm-11pm), Night (11pm-7am)

Staffing: 6-8 FTEs per shift to cover vacation/sick leave (18-24 total analysts)

Pros: Standard work hours, easier scheduling, shorter daily commitment

Cons: Three handoffs daily (higher risk of information loss), weekend coverage requires rotation

Best For: Organizations with sufficient headcount and predictable alert volume

Model 2: 12-Hour Shifts (Panama/DuPont)

Schedule: Two 12-hour shifts: Day (7am-7pm), Night (7pm-7am). 2-2-3 rotation (2 days on, 2 off, 3 on, 2 off, 2 on, 3 off)

Staffing: 4-5 FTEs per shift (8-10 total analysts)

Pros: Fewer handoffs (only 2/day), every other weekend off, compressed work weeks

Cons: Longer shifts can cause fatigue, less overlap between shifts for knowledge transfer

Best For: Smaller teams needing 24/7 coverage with limited headcount

Model 3: Follow-the-Sun (Hybrid with MSSP)

Schedule: Internal team covers business hours (7am-7pm), MSSP covers after-hours/weekends (7pm-7am)

Staffing: 3-4 internal analysts (day shift only) + MSSP contract

Pros: No night/weekend burden on internal team, lower total headcount needed

Cons: Dependency on vendor, potential knowledge gaps between teams, handoff complexity

Best For: Mid-sized organizations building SOC maturity incrementally

Staffing Calculation: For true 24/7 coverage with no gaps, calculate: (Hours per week × Positions per shift) / 40 hours = FTEs needed. Example: 168 hours/week × 2 analysts/shift / 40 = 8.4 FTEs minimum. Add 20-30% buffer for PTO, sick leave, training.

Shift Handoff Procedures

Effective Handoff Components

1.Overlap Period: 15-30 minutes before shift change for real-time communication
2.Written Handoff Log: Standardized template capturing active incidents, pending tasks, environment changes
3.Incident Status Review: Walk through all open incidents with current state and next actions
4.Environmental Context: Planned maintenance, known system issues, elevated threat posture
5.Key Metrics Review: Alert volume, critical alerts pending, SLA compliance
6.Acknowledgment: Incoming shift lead confirms understanding and accepts responsibility

Handoff Log Template

Shift: [Outgoing Shift] → [Incoming Shift]

Date/Time: [YYYY-MM-DD HH:MM]

Outgoing Lead: [Name]

Incoming Lead: [Name]

Active Incidents (High/Critical):

• INC-12345: Ransomware on FILESERV01 - Contained, awaiting forensic imaging

• INC-12346: Account compromise (jdoe@) - Password reset pending user callback

Pending Tasks:

• Follow up with IT Ops on AD replication issue (affecting auth logs)

• Monitor for IOC spread from INC-12345 (see threat intel feed)

Environmental Changes:

• Firewall maintenance window 2am-4am (expect connectivity alerts)

• New EDR deployment to marketing department (baseline noise expected)

Shift Metrics:

• Alert Volume: 347 (avg: 320)

• Critical Alerts Pending: 2 (within SLA)

• False Positive Rate: 28%

Notes:

Elevated phishing campaign targeting finance department - monitor email gateway closely

Acknowledgment: [Incoming Lead Signature] [Timestamp]

Fatigue Management & Wellness

Fatigue Risk Factors in SOC Operations

•Night shift work: Disrupts circadian rhythm, increases error rate by 15-20%
•High alert volume: Decision fatigue after 3-4 hours of continuous triage
•Alert fatigue: Desensitization to warnings due to high false positive rates
•Major incident stress: Extended response efforts (ransomware, data breach) without recovery time
•Weekend/holiday work: Disrupts work-life balance, increases burnout risk

Fatigue Mitigation Strategies

Operational Controls

•Mandatory breaks every 90-120 minutes during shifts
•Rotate high-stress tasks (incident response) with lower-intensity work (documentation)
•Limit consecutive night shifts to 3-4 maximum before rotation
•Enforce maximum overtime limits (no more than 10 hours/week sustained)
•Provide "relief analyst" to cover breaks and prevent uninterrupted 8-12 hour stints

Environmental Optimization

•Blue-light filtering on displays to reduce eye strain and sleep disruption
•Adjustable lighting to match shift (bright for day, dimmer for night)
•Ergonomic workstations (standing desks, dual monitors, quality chairs)
•Quiet areas for focused work away from SOC floor noise
•Healthy snacks and beverages available (avoid excessive caffeine dependency)

Wellness Programs

•Mental health support (EAP access, stress management training)
•Sleep hygiene education for night shift workers
•Regular check-ins with managers to identify burnout warning signs
•Flexible PTO policies with mandatory minimum vacation (e.g., 10 consecutive days annually)
•Post-major-incident recovery time (comp days after ransomware response)

⚠️

Burnout Warning Signs: Watch for declining alert closure rates, increased errors, withdrawal from team communication, excessive sick leave, or expressed frustration with "pointless" alerts. Address immediately with workload adjustment, time off, or role rotation.

Continuous Improvement Programs#

SOC maturity is not a destination but a continuous journey. Effective SOCs implement structured programs for purple teaming, threat hunting, retrospectives, and knowledge sharing to systematically identify and close security gaps.

Purple Team Exercises

Purple Team

Purple Team Exercise Phases

Phase 1: Scope Definition

•Select target techniques (MITRE ATT&CK) to validate (e.g., T1003 Credential Dumping, T1021 Remote Services)
•Define test environment (production, staging, or isolated lab)
•Agree on success criteria (detection fires within X minutes, playbook executed correctly)
•Establish communication protocols (Slack channel, real-time collaboration)

Phase 2: Execution

•Red team executes technique using realistic tools (Cobalt Strike, Metasploit, Atomic Red Team)
•Blue team monitors for alerts and executes response procedures
•Both teams document timestamps: attack start, detection, triage, containment
•Red team provides IOCs and attack artifacts for blue team analysis

Phase 3: Debrief & Analysis

•Review detection performance: What fired? What didn't? Why?
•Evaluate response effectiveness: Correct playbook used? Timely escalation?
•Identify gaps: Blind spots in visibility, missing detections, inadequate response procedures
•Document lessons learned and improvement actions

Phase 4: Remediation

•Develop new detections for missed techniques
•Tune existing detections to reduce false negatives
•Update playbooks with new procedures or context
•Re-test after remediation to validate improvements

Purple Team Cadence & Scope

•Quarterly exercises: Comprehensive campaign simulating full attack lifecycle (initial access → exfiltration)
•Monthly technique tests: Focused validation of 2-3 specific ATT&CK techniques
•Ad-hoc tests: After deploying new detections or investigating novel threats
•Coverage goal: Test all high-priority ATT&CK techniques (based on threat intel) annually

✓

Purple Team ROI: Organizations conducting regular purple team exercises report 40-60% improvement in detection coverage, 30% reduction in MTTD, and significant analyst skill development. The collaborative approach builds trust between red/blue teams and creates actionable improvements vs. traditional adversarial testing.

Threat Hunting Programs

Threat Hunting

Hunting Maturity Levels

Level 0: Ad-Hoc (Initial)

Reactive hunts triggered by specific events (major breach in industry, new threat intelligence). Unstructured, no formal program.

Level 1: Hypothesis-Driven (Structured)

Scheduled hunts based on threat intelligence or risk assessments. Hunters develop hypotheses and search for supporting evidence. Example: "Are we vulnerable to PrintNightmare exploitation?"

Level 2: Data-Driven (Analytics)

Automated data collection, baseline profiling, and statistical anomaly detection guide hunts. UEBA platforms flag outliers for investigation.

Level 3: Self-Improving (Automated)

Machine learning models continuously hunt for anomalies, hunters validate findings and feed back into detection engineering. Closed-loop improvement cycle.

Hunting Methodology (Intelligence-Driven)

Hypothesis Formation

Develop testable hypothesis based on threat intelligence or risk

Example: "Threat actor APT29 uses WMI for lateral movement. Are there unusual WMI process executions in our environment that could indicate compromise?"

Data Collection

Gather relevant telemetry from SIEM, EDR, network logs

Query SIEM for all WMI-related events (Event ID 5857, 5858, 5859) over past 90 days. Filter for remote executions and unusual parent processes.

Analysis & Pattern Detection

Search for anomalies, outliers, or malicious patterns

Analyze command-line parameters, identify non-admin users executing WMI, compare against baseline behavior, correlate with authentication logs.

Investigation

Deep-dive into suspicious findings to confirm or refute hypothesis

Pivot to EDR for process tree, network connections, file modifications. Determine if behavior is malicious, benign automation, or misconfiguration.

Documentation & Improvement

Document findings and translate to detections or mitigations

If threat confirmed: Create incident, develop detection rule, share IOCs. If benign: Document baseline, consider tuning to reduce future noise. Update threat hunting playbook.

Hunting Success Metrics

•Hunt Yield: Number of confirmed compromises discovered per quarter (target: 1+ per quarter for mature programs)
•Detection Coverage Improvement: New detection rules created from hunting findings (target: 5+ per quarter)
•Hunting Efficiency: Hours invested per hunt vs. value of findings (ROI calculation)
•Hypothesis Quality: % of hunts that yield actionable findings (target: >60%)

Retrospectives & Knowledge Sharing

Post-Incident Retrospectives

Conduct blameless retrospectives within 7 days of major incidents to extract systemic improvements:

Key Questions

•What went well during response?
•What could have been done faster or better?
•Were playbooks accurate? What was missing or confusing?
•Did we have the right tools and access?
•How effective was communication with stakeholders?
•What would prevent this type of incident in the future?

Action Items Framework

Categorize improvements by impact and effort (Eisenhower matrix):

•Quick Wins: Low effort, high impact (e.g., add missing SIEM field to alert template)
•Major Projects: High effort, high impact (e.g., deploy EDR to all servers)
•Fill-Ins: Low effort, low impact (e.g., update documentation typos)
•Thankless Tasks: High effort, low impact (deprioritize or eliminate)

Knowledge Sharing Mechanisms

Internal Wiki/Knowledge Base

Centralized repository for playbooks, runbooks, detection logic, lessons learned, environment documentation. Searchable, version-controlled, regularly updated.

Weekly Team Meetings

Review significant incidents, new threats, detection improvements, and upcoming changes. Rotate presentation duties to develop communication skills.

Lunch & Learn Sessions

Monthly 30-minute training on specific topics: new tool features, emerging attack techniques, forensic analysis methods, compliance requirements.

Mentorship Program

Pair junior analysts with senior analysts for structured skill development. Quarterly goal setting, monthly check-ins, shadowing opportunities.

External Community Engagement

Encourage conference attendance (BSides, SANS), participation in threat sharing communities (FS-ISAC, H-ISAC), and contribution to open-source projects (Sigma rules, MISP).

SOC Maturity Assessment#

SOC maturity models provide a structured framework for assessing current capabilities and charting improvement roadmaps. Regular maturity assessments help prioritize investments, demonstrate progress to leadership, and benchmark against industry standards.

SOC-CMM (Capability Maturity Model)

Level 1: Initial (Ad-Hoc)

Characteristics:

• Reactive security operations with no formal processes
• Limited or no centralized logging/monitoring
• Incident response is chaotic and inconsistent
• Heavy reliance on individual heroics and tribal knowledge
• No defined roles or responsibilities for security operations

Typical MTTD/MTTR: Months to detect, weeks to respond

Key Improvement: Establish basic logging, deploy initial SIEM/EDR, create incident response plan

Level 2: Managed (Repeatable)

Characteristics:

• Basic monitoring infrastructure in place (SIEM, EDR)
• Documented incident response procedures
• Defined SOC team with roles (Tier 1/2 analysts)
• Some detection use cases deployed (10-20 rules)
• Inconsistent documentation and knowledge transfer

Typical MTTD/MTTR: Weeks to detect, days to respond

Key Improvement: Standardize playbooks, increase detection coverage, implement metrics

Level 3: Defined (Standardized)

Characteristics:

• Standardized processes across detection, investigation, response
• Comprehensive detection library (50+ use cases) mapped to MITRE ATT&CK
• Full SOC staffing (Tier 1/2/3) with defined career paths
• Documented playbooks for common scenarios
• Regular metrics reporting and performance tracking
• Initial automation with SOAR for simple tasks

Typical MTTD/MTTR: Days to detect, hours to respond

Key Improvement: Implement continuous improvement programs (purple team, threat hunting)

Level 4: Quantitatively Managed (Measured)

Characteristics:

• Data-driven decision making with comprehensive KPI tracking
• High detection fidelity (>70% true positive rate)
• Extensive automation reducing analyst toil by 40%+
• Regular purple team exercises validating detection coverage
• Mature threat hunting program with quarterly campaigns
• Integration with threat intelligence for proactive defense

Typical MTTD/MTTR: Hours to detect, minutes to respond (for automated scenarios)

Key Improvement: Predictive analytics, behavioral modeling, advanced automation

Level 5: Optimizing (Continuous Improvement)

Characteristics:

• Industry-leading capabilities with innovation focus
• AI/ML-driven detection and response automation
• Proactive threat hunting discovers advanced threats before impact
• Continuous validation and optimization of all processes
• Cross-industry threat sharing and collaboration
• SOC serves as center of excellence for organization

Typical MTTD/MTTR: Minutes to detect (real-time), automated containment for known threats

Key Focus: Maintain excellence, share knowledge with community, innovate new capabilities

NIST CSF SOC Alignment

Identify (Asset & Risk Management)

•Maintain comprehensive asset inventory with criticality ratings
•Map data flows and classify sensitive information
•Conduct threat modeling to inform detection priorities

Protect (Preventive Controls)

•Deploy protective technologies (EDR, firewall, DLP)
•Implement access controls and least privilege principles
•Conduct security awareness training to reduce user risk

Detect (Monitoring & Analysis)

•Continuous monitoring across endpoints, network, cloud
•Detection engineering with MITRE ATT&CK coverage
•Behavioral analytics and anomaly detection for zero-days

Respond (Incident Management)

•Documented incident response procedures and playbooks
•Rapid containment and eradication capabilities
•Communication protocols for internal/external stakeholders

Recover (Business Continuity)

•System recovery and restoration procedures
•Post-incident lessons learned and improvement tracking
•Business continuity planning integration