Domain 4

IS Operations & Business Resilience

Keep the lights on — and know what to do when they go out. The largest CISA domain, covering everything from daily operations to disaster recovery.

Domain 4: IS Operations & Business Resilience 23%
👩‍💼

Alex's Week 4 — The Payday Outage

Alex survived three weeks of planning, governance reviews, and a runaway development project. This week was supposed to be quieter — an audit of IT operations and business continuity. Then at 9:47am on a Friday, the payroll system went down. Alex closed her laptop and walked to the server room. This was going to be the most educational Friday of her career.

D4-01 — Command Center

Domain 4 Overview

A bustling vintage command center with operators managing schedules, help desks, and tape reels
A

Part A: Operations (4.1–4.6)

  • 4.1 IT Operations
  • 4.2 Hardware & Infrastructure
  • 4.3 Network Management
  • 4.4 IT Service Management
  • 4.5 Database Management
  • 4.6 Performance Monitoring
B

Part B: Resilience (4.7–4.10)

  • 4.7 Business Continuity
  • 4.8 Disaster Recovery
  • 4.9 Backup & Recovery
  • 4.10 Incident Management
C

Part C: Emerging (4.11–4.12)

  • 4.11 End-User Computing
  • 4.12 Cloud Computing

Domain Weight: 23% (~34 questions) — Largest CISA Domain

This is the single biggest domain on the exam. Master BCP/DRP concepts, understand recovery objectives, and know your backup strategies inside-out.

Key Exam Tip

Domain 4 is the largest at 23% — expect roughly 34 questions. BCP/DRP and incident management are the most heavily tested areas. Know RTO vs. RPO, hot/warm/cold sites, and the incident response lifecycle by heart.

Part A Section 4.1

IT Operations

D4-01 — Operations Command Center

IT Operations Management

👩‍💼

The payroll batch job ran at 2am. Output was generated. But the file never reached the portal — and no alert fired. Alex checks the monitoring configuration: there is none. No completion check. No failure notification. She writes her first finding at 9:52am: “Basic operations monitoring: absent.”

📋

Job Scheduling

  • Automated batch job execution
  • Dependency management
  • SLA-driven scheduling
  • Auditor checks: authorization, logs, error handling
📞

Help Desk / Service Desk

  • Single point of contact (SPOC)
  • Ticket tracking & escalation
  • First-call resolution rate
  • Knowledge base management
🖨️

Operations Controls

  • Print/output management
  • Tape/media library controls
  • Lights-out operations
  • Segregation of duties in IT ops

Key Auditor Concern: Segregation of Duties (SoD)

Computer operators should not have access to modify programs, data, or system documentation. The auditor should verify that operations staff cannot make unauthorized changes to production.

Key Exam Tip

The most critical control in IT operations is segregation of duties. Operators should never have access to change programs or data. The exam loves questions about what operators should and should not be able to do.

📰 Real World

British Airways Bank Holiday 2017 outage — engineer accidentally powered down a data centre UPS. No automated monitoring detected it. 75,000 passengers stranded.

✏️ TEST YOURSELF
Q1. An IS auditor discovers that a batch job failed overnight but no alert was generated. What is the MOST significant risk?
A. Critical outputs may not be delivered, and failures go undetected indefinitely
B. The batch job will need to be rerun manually
C. Operators will have less work to do during the shift
D. The help desk will receive more calls than normal
Reveal Answer

✓ Correct: A

Without automated monitoring and alerting, job failures can go undetected for extended periods, leading to missing outputs and cascading impacts. B describes a consequence but not the primary risk. C and D are minor operational effects.

Q2. Which control BEST ensures that computer operators cannot make unauthorized changes to production programs?
A. Regular review of operator activity logs
B. Segregation of duties between operations and programming
C. Requiring operators to sign a code of conduct
D. Installing antivirus on operator workstations
Reveal Answer

✓ Correct: B

Segregation of duties is the primary preventive control. Operators should not have access to source code or production program libraries. A is detective, not preventive. C is administrative but not enforceable. D is unrelated.

Q3. An organization runs lights-out (unattended) operations overnight. What is the FIRST control an IS auditor should verify?
A. That physical access to the data centre is restricted
B. That automated job scheduling and monitoring tools are properly configured
C. That operators are available by phone
D. That backup tapes are rotated nightly
Reveal Answer

✓ Correct: B

In lights-out operations, automated monitoring is the primary control because no humans are present. Without it, failures go undetected. A is important but secondary. C partially defeats the purpose of lights-out. D is a separate concern.

The batch job ran fine — the infrastructure underneath it is the next question.

Part A Section 4.2

Hardware & Infrastructure

D4-02 — Data Center City

Hardware & Infrastructure Components

👩‍💼

Alex pulls up the infrastructure documentation. Payroll runs on a single physical server. No clustering. No failover. The infrastructure diagram is dated 2019. “Some things may have changed,” the infrastructure manager admits. Alex notes finding number two: single point of failure, no redundancy, stale documentation.

A futuristic data center city with server towers, SAN storage, NAS filing cabinet, and IaaS PaaS SaaS clouds
🗄️

SAN (Storage Area Network)

  • Dedicated high-speed storage network
  • Block-level access
  • Used for databases, email servers
  • Expensive but high-performance
📁

NAS (Network Attached Storage)

  • File-level access over LAN
  • Shared storage for file servers
  • Simpler and cheaper than SAN
  • Uses TCP/IP protocols
Server Types & Configurations

Mainframe

Centralized, high-reliability processing for critical batch jobs and transaction processing.

Client-Server

Distributed processing: thin clients (server-dependent) vs. thick clients (local processing).

Virtualization

Multiple virtual servers on one physical host. Key risk: hypervisor compromise.

Key Exam Tip

Know SAN vs. NAS: SAN = block-level, dedicated network, high performance. NAS = file-level, over existing LAN, simpler. Virtualization’s biggest risk is a compromised hypervisor, which could affect all hosted VMs.

📰 Real World

2021 Facebook 6-hour outage — BGP configuration change took down DNS servers. The team couldn’t remotely fix it because access tools also ran on affected infrastructure.

✏️ TEST YOURSELF
Q1. An IS auditor finds that a critical payroll application runs on a single physical server with no failover. What should be the auditor's PRIMARY recommendation?
A. Implement server clustering or redundancy for critical systems
B. Increase the server's processing power
C. Schedule more frequent backups
D. Document the current configuration
Reveal Answer

✓ Correct: A

A single point of failure for a critical system is an unacceptable risk. Clustering provides redundancy. B improves performance but not availability. C helps recovery but not prevention. D is useful but doesn't address the risk.

Q2. What is the GREATEST risk of server virtualization from an audit perspective?
A. Virtual machines use more storage than physical servers
B. A compromised hypervisor could affect all hosted virtual machines
C. Virtualization software requires frequent updates
D. Virtual servers cannot be backed up easily
Reveal Answer

✓ Correct: B

The hypervisor is the single point of control for all VMs. If compromised, every VM on that host is at risk. A is incorrect. C is a maintenance concern, not the greatest risk. D is false — virtual servers can be backed up.

Q3. Which storage technology provides block-level access over a dedicated high-speed network?
A. NAS (Network Attached Storage)
B. SAN (Storage Area Network)
C. DAS (Direct Attached Storage)
D. Cloud object storage
Reveal Answer

✓ Correct: B

SAN provides block-level access over a dedicated network, ideal for high-performance databases. NAS provides file-level access over LAN. DAS is directly connected to one server. Cloud object storage is accessed via APIs.

The server is there. But something between it and the users is broken — the network.

Part A Section 4.3

Network Management

D4-03 — Seven-Story Building

The OSI Model — 7 Layers

👩‍💼

A firewall rule was changed at 2am — blocking the payroll server’s outbound connection to the portal. Nobody reviewed the change. Nobody approved it. The logs showed the block clearly, but nobody was watching. Alex writes finding number three: uncontrolled change to a network device with no review process.

A seven-story building cutaway with each floor representing an OSI layer, workers passing data packages between floors
7 Application — HTTP, FTP, SMTP, DNS — what users interact with
6 Presentation — Encryption, compression, data formatting (SSL/TLS)
5 Session — Manages connections/sessions between applications
4 Transport — TCP/UDP, flow control, error recovery, segmentation
3 Network — IP addressing, routing (routers operate here)
2 Data Link — MAC addresses, frames (switches operate here)
1 Physical — Cables, hubs, electrical signals, bits on the wire
Mnemonic — Remember from Layer 7 down

Use this phrase to remember all 7 layers top-to-bottom:

“All People Seem To Need Data Processing”

Or bottom-up: “Please Do Not Throw Sausage Pizza Away” (Physical → Application)

Key Network Devices

  • Hub — Layer 1, broadcasts to all
  • Switch — Layer 2, uses MAC addresses
  • Router — Layer 3, uses IP addresses
  • Firewall — Layers 3-7, filters traffic

Network Types

  • LAN — Local, same building/campus
  • WAN — Wide, connects distant LANs
  • VPN — Encrypted tunnel over public network
  • VLAN — Logical segmentation of a LAN
Key Exam Tip

Know which devices operate at which OSI layers: hubs at Layer 1, switches at Layer 2, routers at Layer 3, firewalls at Layers 3-7. The exam frequently tests which layer a specific technology or attack targets.

📰 Real World

Knight Capital 2012 — misconfigured routing on 1 of 8 servers caused $440M loss in 45 minutes. Configuration changes without review are one of the highest-risk IT operations.

✏️ TEST YOURSELF
Q1. A firewall rule change at 2am blocks a critical server. No change approval exists. What control failure does this BEST illustrate?
A. Lack of intrusion detection
B. Absence of change management for network devices
C. Inadequate firewall technology
D. Poor network segmentation
Reveal Answer

✓ Correct: B

The core issue is an unapproved, unreviewed change. Change management requires approval, testing, and rollback plans before any production change. A is about detection, not prevention. C and D are unrelated to the approval process.

Q2. An IS auditor reviews network documentation and finds that the organisation uses a flat network with all servers, workstations, and IoT devices on the same subnet. What is the GREATEST risk?
A. Network performance degradation due to broadcast traffic
B. Lateral movement — a compromised device can directly reach all other systems
C. Difficulty in assigning static IP addresses
D. Inability to implement wireless networking
Reveal Answer

✓ Correct: B

Without network segmentation, an attacker who compromises any device (e.g., an IoT sensor) can reach critical servers directly. This is the greatest security risk of a flat network. Performance issues (A) are operational, not the greatest risk. IP addressing (C) and wireless (D) are unrelated to the flat network topology.

Q3. An IS auditor reviews firewall logs and finds that blocked traffic is logged but never reviewed. What is the PRIMARY risk?
A. The firewall will run out of storage
B. Security incidents may go undetected despite being logged
C. The firewall performance will degrade
D. Compliance reports will be incomplete
Reveal Answer

✓ Correct: B

Logging without review provides no security value. The logs captured the blocked payroll connection, but nobody was watching. A and C are operational concerns. D is secondary to the detection gap.

The firewall blocked the connection. But how is the team even managing this incident?

Part A Section 4.4

IT Service Management (ITIL)

D4-04 — ITIL Workshop

ITIL Service Management Processes

👩‍💼

One hour into the outage. Alex asks for the incident record. There isn’t one. The team is coordinating via WhatsApp. No ticket number. No severity classification. No escalation timeline. No formal ITIL process whatsoever. “We just fix things,” says the team lead. Alex adds finding number four.

A hospital emergency room with four clearly labeled stations: Incident Desk, Problem Clinic, Change Committee, Configuration Pharmacy
🚨

Incident Management

“Restore service ASAP”

  • Focus: restore normal operations quickly
  • Does NOT find root cause
  • Workarounds are acceptable
  • Measured by: time to restore
🔍

Problem Management

“Find and fix the root cause”

  • Focus: identify underlying cause
  • Prevents recurrence
  • Creates known error database (KEDB)
  • Proactive & reactive modes

Change Management

“Control all changes”

  • Submit RFC (Request for Change)
  • Impact analysis & approval
  • CAB (Change Advisory Board)
  • Rollback plan required
📦

Configuration Management

“Know what you have”

  • CMDB — Configuration Management DB
  • Tracks all IT assets (CIs)
  • Baseline configurations
  • Supports all other ITIL processes
Mnemonic — Incident vs. Problem

Think of it like a hospital: Incident = Emergency Room (stop the bleeding now), Problem = Diagnostic Lab (find out why the patient keeps getting sick). Emergency first, diagnosis second.

Key Exam Tip

The #1 tested ITIL concept: Incident management restores service (workaround OK), problem management finds root cause (permanent fix). Change management requires a rollback plan. Every change must go through the CAB or emergency CAB (ECAB).

📰 Real World

2003 Northeast blackout — utilities without formal incident management took days to restore power. Those with structured escalation had significantly faster recovery.

✏️ TEST YOURSELF
Q1. An IS auditor finds the IT team manages incidents via a WhatsApp group with no formal logging. What is the PRIMARY risk?
A. Slow resolution times
B. Loss of audit trail and inability to analyse patterns
C. Poor team communication
D. Escalation delays
Reveal Answer

✓ Correct: B

Without formal incident logging, there is no audit trail, no data for trend analysis, and no evidence of compliance. A, C, and D are possible effects but the fundamental risk is the loss of documentation and pattern analysis capability.

Q2. After a major outage is resolved, what ITIL process should be initiated to prevent recurrence?
A. Incident management
B. Change management
C. Problem management
D. Configuration management
Reveal Answer

✓ Correct: C

Problem management investigates root causes and prevents recurrence. Incident management only restores service. Change management handles modifications. Configuration management tracks assets.

Q3. A change to a firewall rule caused a payroll outage. Which control would MOST likely have prevented this?
A. More frequent backups
B. Change Advisory Board (CAB) review before implementation
C. Stronger firewall hardware
D. Additional firewall rules
Reveal Answer

✓ Correct: B

CAB review ensures changes are assessed for impact, approved, and have rollback plans before implementation. A is about recovery, not prevention. C and D don't address the process failure.

The process is broken. But what about the data sitting inside the payroll database?

Part A Section 4.5

Database Management

D4-05 — Data Cathedral

Database Management Systems

👩‍💼

The payroll data is intact — the database wasn’t the problem. But querying it requires restarting the DBMS, which takes 40 minutes. Alex asks when the last database health check was performed. “We’d know if something was wrong,” says the DBA. Alex writes: reactive posture, no proactive monitoring or integrity checks.

A cathedral-like library with monks normalizing records, concurrency control gates, and a glowing DBMS engine
Normalization Levels
1
1NF

Eliminate repeating groups — atomic values only

2
2NF

Remove partial dependencies on composite keys

3
3NF

Remove transitive dependencies — every non-key depends only on the primary key

Concurrency Controls

  • Locking — Prevents simultaneous updates
  • Deadlock — Two processes waiting for each other
  • Timestamping — Orders transactions by time
  • Optimistic — Allow all, check at commit

Data Integrity

  • Referential — Foreign keys match primary keys
  • Entity — Primary keys are unique, not null
  • Domain — Values within valid range
  • ACID — Atomicity, Consistency, Isolation, Durability
Mnemonic — ACID Properties

Atomicity — all or nothing. Consistency — valid state before and after. Isolation — transactions don’t interfere. Durability — once committed, it’s permanent. Think of a bank transfer: it must be all four or your money vanishes.

Key Exam Tip

Know ACID properties cold — especially Atomicity (all or nothing). For normalization, 3NF is the exam standard: “every non-key attribute depends on the key, the whole key, and nothing but the key.” Referential integrity = foreign key validity.

📰 Real World

Equifax 2017 — database activity monitoring was misconfigured, generating no alerts for 78 days while 147M records were exfiltrated.

✏️ TEST YOURSELF
Q1. A DBA states “We’d know if something was wrong” regarding database health. What does this attitude MOST indicate?
A. Confidence in the DBMS product
B. A reactive rather than proactive monitoring posture
C. Adequate knowledge of database administration
D. That automated alerts are functioning properly
Reveal Answer

✓ Correct: B

The statement reveals reliance on reactive detection rather than proactive health checks, integrity validation, and monitoring. Without scheduled checks, problems may go undetected until they cause outages.

Q2. An IS auditor discovers that a database administrator has direct write access to production financial tables and no audit trail exists for DBA activities. What is the MOST significant risk?
A. Database performance may be degraded by DBA queries
B. The DBA could modify financial data without detection
C. The database schema may become inconsistent
D. Backup procedures may not capture DBA changes
Reveal Answer

✓ Correct: B

Direct write access combined with no audit trail creates a fraud risk — the DBA could alter financial records with no way to detect the changes. This is a critical segregation-of-duties and detective control failure. Performance (A) is operational. Schema issues (C) are a technical concern. Backup gaps (D) are secondary to the data integrity risk.

Q3 (TRAP). An IS auditor finds that a customer database stores the same customer address in five different tables, and discrepancies exist between them. An analyst suggests normalising the database to fix this. What should the auditor's PRIMARY concern be?
A. Normalisation will slow down query performance
B. The data integrity issues caused by the current redundancy
C. The cost of redesigning the database
D. Whether the development team has normalisation skills
Reveal Answer

✓ Correct: B

The auditor's primary concern is the data integrity risk from redundant, inconsistent data — this is what drives the need for action. The trap is A — while normalisation can affect performance, the auditor should focus on the data integrity risk first. Cost (C) and team skills (D) are implementation considerations, not the auditor's primary concern.

The database is fine. But is anyone actually watching the system’s vital signs?

Part A Section 4.6

System Performance Monitoring

D4-06 — Observatory Tower

Performance Monitoring & Capacity Planning

👩‍💼

Alex asks to see the performance dashboard. There isn’t one. The team uses command-line tools when something feels slow. What’s the payroll system’s SLA for availability? Nobody knows. It was never defined. No baseline metrics, no capacity planning, no thresholds. Finding number six: no performance monitoring framework.

An observatory tower with operators watching performance dashboards, capacity meters, and a telescope pointed at future needs
📊

Key Performance Metrics

  • CPU utilization — processor workload %
  • Memory usage — RAM consumption
  • I/O throughput — disk read/write speed
  • Network latency — response delay
  • Transaction rate — TPS (transactions/sec)
🔭

Capacity Planning

  • Forecast future resource needs
  • Trend analysis on historical data
  • Model “what-if” scenarios
  • Balance cost vs. performance
  • Prevent bottlenecks proactively
Service Level Agreements (SLAs)

Uptime / Availability

99.9% = 8.77 hours downtime/year. 99.99% = 52.56 minutes/year. Know the “nines.”

Response Time

Maximum acceptable delay for user transactions. Measured at the user interface level.

Throughput

Volume of work completed per unit time. Must handle peak loads without degradation.

Key Exam Tip

SLAs define the minimum acceptable service levels. An auditor should verify that SLAs are documented, measurable, monitored, and that penalties for non-compliance are defined. Capacity planning should be proactive, not reactive.

📰 Real World

Delta Air Lines 2016 — power control module failure cascaded for hours because no real-time monitoring thresholds triggered an alert. Cost: $150M+.

✏️ TEST YOURSELF
Q1. An IS auditor finds that no SLA exists for a critical payroll system. What is the MOST significant consequence?
A. Users cannot complain about downtime
B. There is no agreed baseline to measure performance or availability against
C. The system will perform poorly
D. IT staff have no motivation to fix issues
Reveal Answer

✓ Correct: B

Without an SLA, there is no defined acceptable level of service, making it impossible to measure compliance, trigger escalations, or hold anyone accountable. A, C, and D are secondary effects.

Q2. Which capacity planning approach is MOST appropriate for preventing future performance issues?
A. Adding resources after users complain
B. Proactive trend analysis of historical performance data
C. Purchasing the most powerful hardware available
D. Limiting the number of users on the system
Reveal Answer

✓ Correct: B

Proactive trend analysis uses historical data to predict future needs before problems occur. A is reactive. C is wasteful without data-driven needs. D restricts business operations.

Q3. A system has a 99.9% uptime SLA. What is the maximum acceptable annual downtime?
A. 52.56 minutes
B. 5.26 hours
C. 8.77 hours
D. 3.65 days
Reveal Answer

✓ Correct: C

99.9% uptime allows 0.1% downtime = 8,760 hours x 0.001 = 8.76 hours per year. 99.99% = 52.56 minutes. 99% = 3.65 days.

Operations are unmonitored. Now Alex turns to the bigger question: what if the whole building goes down?

Part B Section 4.7

Business Continuity Planning (BCP)

D4-07 — Medieval Town Prepares

BCP Lifecycle & Business Impact Analysis

👩‍💼

Alex asks to speak with the BCP coordinator. He’s on leave. The deputy doesn’t know where the BCP document is stored. When it’s finally located on a shared drive, it references a hot site whose contract was not renewed eighteen months ago. The BCP was last tested two years ago. Alex’s findings are multiplying.

A medieval town preparing for a storm with scribes analyzing critical assets, a plan-test-maintain-improve wheel, and scouts on watchtowers
BCP Lifecycle
1
Project Initiation

Senior management sponsorship, scope definition, team formation

2
BIA

Business Impact Analysis — identify critical functions & maximum tolerable downtime

3
Strategy Development

Select recovery strategies for each critical function

4
Plan Development

Document procedures, roles, responsibilities, contacts

5
Testing & Training

Exercise the plan, train staff, validate recovery

6
Maintenance

Regular updates, reviews, and improvements

Business Impact Analysis (BIA) — The Foundation

The BIA is the first major step after project initiation. It identifies:

  • Critical business processes
  • Maximum Tolerable Downtime (MTD)
  • RTO & RPO for each process
  • Resource dependencies
  • Financial impact of downtime
  • Operational impact over time
BCP Test Types (Least to Most Disruptive)
Checklist

Review the plan on paper

Walkthrough

Tabletop discussion of the plan

Simulation

Rehearse specific scenarios

Parallel

Activate recovery site alongside primary

Full Interruption

Shut down primary, switch to backup — highest risk

Key Exam Tip

BIA must be completed BEFORE developing recovery strategies. Senior management must sponsor the BCP — this is non-negotiable. The most realistic test is a full interruption test, but a parallel test is the most common realistic test because it doesn’t disrupt primary operations.

📰 Real World

Hurricane Sandy 2012 — several financial institutions found their BCP alternate sites were in the same flood zone as primary sites.

✏️ TEST YOURSELF
Q1. An IS auditor discovers that a BCP references a hot site whose contract expired 18 months ago. What does this BEST indicate?
A. The hot site is no longer needed
B. The BCP maintenance process has failed
C. The BCP was never properly approved
D. The hot site vendor went out of business
Reveal Answer

✓ Correct: B

An expired recovery site contract in the BCP indicates that the plan is not being maintained and updated. Regular review should catch such gaps. A, C, and D are assumptions not supported by the facts.

Q2. Which BCP test type provides the MOST realistic validation without disrupting primary operations?
A. Checklist review
B. Tabletop walkthrough
C. Parallel test
D. Full interruption test
Reveal Answer

✓ Correct: C

A parallel test activates the recovery site while primary operations continue, providing realistic validation without risk. Full interruption is most realistic but disrupts operations. Checklist and walkthrough are theoretical exercises.

Q3. What must be completed BEFORE developing BCP recovery strategies?
A. Selection of a hot site vendor
B. Business Impact Analysis (BIA)
C. Full interruption test
D. Staff training on recovery procedures
Reveal Answer

✓ Correct: B

The BIA identifies critical functions, MTD, RTO, and RPO — all of which drive recovery strategy selection. You cannot choose a strategy without knowing what you're protecting and how quickly it must recover.

The BCP is a paper exercise. What about the actual disaster recovery specifics?

Part B Section 4.8

Disaster Recovery Planning (DRP)

D4-08 — The Hourglass

RTO, RPO & Recovery Sites

👩‍💼

Alex asks the critical questions. RTO for payroll? Nobody knows. RPO? Nobody knows either. Backups run daily at 3am. It’s now 11am. If they had to restore, that’s 8 hours of data loss. For a payroll system processing 2,400 employees, that’s unacceptable — but nobody ever defined what “acceptable” means.

A dramatic split hourglass with RPO showing data falling backwards toward last backup, RTO showing recovery teams racing forward, hot/warm/cold sites in background
The Recovery Equation
RPO ← ⚡ Disaster ⚡ → RTO

RPO — Recovery Point Objective

“How much data can we lose?”

  • Looks BACKWARD from disaster
  • Maximum tolerable data loss
  • Determines backup frequency
  • RPO = 4 hrs → backup every 4 hrs

RTO — Recovery Time Objective

“How fast must we recover?”

  • Looks FORWARD from disaster
  • Maximum tolerable downtime
  • Determines recovery site type
  • RTO = 0 → need a hot site
Recovery Site Types
Feature
Hot Site
Warm Site
Cold Site
Hardware
Fully equipped
Partially equipped
Empty shell
Data
Near real-time
Recent backup
None
Ready in
Minutes to hours
Hours to days
Days to weeks
Cost
$$$ Highest
$$ Moderate
$ Lowest
Best for
Zero downtime tolerance
Hours-level RTO
Days/weeks RTO OK
Mnemonic — RPO vs. RTO

RPO = Recovery Point = the point in the past you rewind to (data loss). RTO = Recovery Time = time to get back up (downtime). Point looks back, Time looks forward.

Key Exam Tip

This is the most heavily tested concept in Domain 4. RPO drives backup strategy, RTO drives recovery site selection. A hot site is ready immediately but is the most expensive. Also know: reciprocal agreements are least reliable because they depend on another organization’s capacity.

📰 Real World

2011 Thailand floods destroyed hard drive plants. Companies without documented RTO/RPO had no criteria for activating alternate suppliers.

✏️ TEST YOURSELF
Q1. A company’s RPO for its payroll system is 4 hours. Daily backups run at 3am. A failure occurs at 11am. What is the data loss exposure?
A. 4 hours
B. 8 hours
C. 11 hours
D. 24 hours
Reveal Answer

✓ Correct: B

The last backup was at 3am. Failure at 11am means 8 hours of data since the last backup. This exceeds the 4-hour RPO, indicating the backup frequency is inadequate for the defined RPO.

Q2. An organization requires near-zero downtime for its trading platform. Which recovery site type is MOST appropriate?
A. Cold site
B. Warm site
C. Hot site
D. Reciprocal agreement
Reveal Answer

✓ Correct: C

A hot site has fully configured hardware with near real-time data replication, enabling recovery in minutes to hours. Cold sites take days/weeks. Warm sites take hours/days. Reciprocal agreements are the least reliable option.

Q3 (TRAP). A company's payroll system has an RPO of 4 hours and performs daily backups at midnight. A system failure occurs at 3pm. What is the MOST significant finding?
A. The backup schedule meets the RPO requirement
B. Up to 15 hours of data could be lost, far exceeding the 4-hour RPO
C. The RTO has been exceeded
D. Daily backups are industry standard and therefore acceptable
Reveal Answer

✓ Correct: B

With a midnight backup and a 3pm failure, up to 15 hours of data could be lost. The RPO of 4 hours means the organisation can only tolerate losing 4 hours of data — the backup frequency is grossly inadequate. The trap is A — daily backups do NOT meet a 4-hour RPO. C is wrong because RTO measures recovery time, not data loss. D appeals to a false sense of "industry standard."

Recovery objectives are undefined. But what about the actual backups — do they even work?

Part B Section 4.9

Backup & Recovery

D4-09 — The Library Copyists

Backup Types & Strategies

👩‍💼

Last backup: last night at 3am. Last backup test: 14 months ago. The team can confirm backups complete, but nobody has verified that a restore actually produces a working system. Alex notes this separately — it’s a finding that stands on its own. A backup you’ve never tested is a hope, not a control.

Three filing cabinet drawers labeled Full, Incremental, and Differential with a technician deciding which drawer to open during recovery
📚

Full Backup

  • Copies ALL data every time
  • Slowest to create
  • Fastest to restore
  • Resets archive bit
📗

Incremental

  • Only data changed since LAST backup (any type)
  • Fastest to create
  • Slowest to restore (need all incrementals)
  • Resets archive bit
📙

Differential

  • Data changed since last FULL backup
  • Grows larger each day
  • Restore = full + latest differential
  • Does NOT reset archive bit
Restore Speed Comparison
Backup Type
Backup Speed
Restore Speed
Full
Slowest (copies everything)
Fastest (one tape/set)
Incremental
Fastest (only changes)
Slowest (full + all incrementals)
Differential
Medium (growing daily)
Medium (full + last differential)

Off-site Storage & Media Rotation

  • Grandfather-Father-Son (GFS) — daily/weekly/monthly rotation
  • Off-site vaulting — store backups at a separate location
  • Electronic vaulting — batch transfer to off-site
  • Remote journaling — real-time transaction log transfer
Key Exam Tip

Key distinction: Incremental backs up since the LAST backup (any type), Differential backs up since the last FULL backup. For recovery, differential is faster than incremental (need only 2 sets vs. potentially many). Off-site backups are essential — same building = same disaster.

📰 Real World

GitLab 2017 — accidentally deleted primary database. Backup system had been failing silently for months. Only 5GB of 300GB was recoverable. Regular testing, not just running, matters.

✏️ TEST YOURSELF
Q1. An IS auditor verifies that backups run nightly. What additional evidence BEST confirms backup adequacy?
A. Backup logs showing completion
B. Evidence of successful restoration tests
C. Offsite storage receipts
D. Backup software vendor certification
Reveal Answer

✓ Correct: B

Successful restoration tests prove that backups actually work and can produce a usable system. Completion logs only confirm the backup ran, not that data is recoverable. Offsite storage and vendor certification don't verify recoverability.

Q2. Which backup type requires the MOST media sets for a full restore?
A. Full backup
B. Differential backup
C. Incremental backup
D. Mirror backup
Reveal Answer

✓ Correct: C

Incremental restore requires the last full backup plus every incremental since then (potentially many sets). Differential needs only full + last differential (2 sets). Full needs only one set.

Q3. A backup system has been completing nightly for 14 months without a restoration test. What is the PRIMARY risk?
A. Backup media may have degraded
B. Backups may not be restorable, and the organization won’t know until a real disaster
C. Storage costs are being wasted
D. The backup schedule may be inefficient
Reveal Answer

✓ Correct: B

Without restoration testing, there is no assurance that backups produce working systems. GitLab's 2017 incident proved that running backups and having working backups are very different things.

Backups exist but are untested. Three hours in — the team finally finds the fix.

Part B Section 4.10

Incident Management

D4-10 — Fire Station Response

Incident Response Lifecycle

👩‍💼

Three hours in, the misconfigured firewall rule is identified and reversed. The payroll portal comes back online. The team relaxes. Alex does not. She has been watching the entire response unfold without a single documented step. She now has 11 findings — and the incident itself just became finding material for how not to run incident response.

A fire station with five sequential bays showing Detection, Containment, Eradication, Recovery, and Lessons Learned
5-Phase Incident Response Process
1
Detection

Identify the incident through monitoring, alerts, or reports

2
Containment

Limit the damage — isolate affected systems

3
Eradication

Remove the root cause — clean malware, patch vulnerability

4
Recovery

Restore systems to normal operations, verify integrity

5
Lessons Learned

Post-mortem review — what went wrong, what to improve

Mnemonic — DCERL

Remember the order with: “Don’t Cry, Every Recovery Leads” to better security.

“Detect → Contain → Eradicate → Recover → Learn”

Incident Response Team (IRT)

  • Pre-designated team with clear roles
  • Must have management authority
  • Cross-functional (IT, legal, HR, PR)
  • 24/7 contact information maintained
  • Regular training and exercises

Key Documentation

  • Incident log with timestamps
  • Chain of custody for evidence
  • Classification & prioritization
  • Communication plan (internal & external)
  • Post-incident report
Key Exam Tip

After detection, the FIRST priority is always containment — stop the bleeding before investigating. “Lessons learned” is the most frequently skipped step in practice — the exam considers it mandatory. Preserve evidence chain of custody for potential legal proceedings.

📰 Real World

Target 2013 — FireEye alerts were dismissed. Breach persisted 16 days. Detection without response is not security.

✏️ TEST YOURSELF
Q1. After detecting a security incident, what should be the IS auditor’s FIRST recommended action?
A. Begin forensic analysis
B. Contain the incident to prevent further damage
C. Notify law enforcement
D. Document the root cause
Reveal Answer

✓ Correct: B

Containment is always the first priority after detection — limit the blast radius. Forensics and root cause come after containment. Law enforcement notification depends on the incident type and may not be the first step.

Q2. Which phase of incident response is MOST commonly skipped in practice but considered mandatory by ISACA?
A. Detection
B. Containment
C. Eradication
D. Lessons learned
Reveal Answer

✓ Correct: D

Organizations often rush back to normal operations and skip the post-incident review. ISACA considers lessons learned essential for improving the incident response process and preventing recurrence.

Q3. An incident response team resolves a payroll outage in 3 hours but documents nothing during the process. What is the GREATEST risk?
A. The team cannot be rewarded for their work
B. No audit trail exists for analysis, pattern detection, or regulatory compliance
C. The team will forget the technical steps
D. Management will not know the outage occurred
Reveal Answer

✓ Correct: B

Without documentation, there is no evidence for audit, no data for trend analysis, no compliance evidence, and no basis for lessons learned. The other options are secondary consequences.

The system is back. But Alex has noticed something troubling on a Finance laptop.

Part C Section 4.11

End-User Computing & Shadow IT

D4-11 — Light vs. Shadow

End-User Computing Risks & Controls

👩‍💼

While the payroll portal was down, Finance pulled out their backup plan: a parallel payroll spreadsheet maintained “just in case.” It’s on a personal laptop. Unsecured. Unencrypted. It contains salary data for all 2,400 employees. The Finance manager doesn’t see the problem. Alex sees finding number eleven — and possibly the worst one yet.

Split office scene — bright side with approved IT systems and guards, shadowy side with unauthorized devices and personal cloud apps, auditor with flashlight in the middle
⚠️

Shadow IT Risks

  • Unpatched, unmanaged devices
  • Data stored outside corporate control
  • No backup or recovery
  • Compliance violations (GDPR, HIPAA)
  • Unauthorized cloud services (SaaS)
  • Spreadsheet-based “applications”
🛡️

Recommended Controls

  • End-user computing policy
  • Application whitelisting
  • CASB (Cloud Access Security Broker)
  • DLP (Data Loss Prevention)
  • Regular discovery and inventory
  • User awareness training

EUC Application Risks

End-user developed applications (spreadsheets, Access databases, macros) are high-risk because they typically lack:

  • Change management controls
  • Version control
  • Input validation
  • Documentation
  • Access controls
  • Audit trails
Key Exam Tip

The greatest risk of end-user computing is lack of IT controls — no change management, no backups, no audit trails. Shadow IT bypasses all corporate governance. The auditor’s first recommendation should be a comprehensive EUC policy, not banning personal devices outright.

📰 Real World

UK government 2012 — fined £150,000 after employee stored 24,000 people’s personal data on an unencrypted personal laptop that was stolen.

✏️ TEST YOURSELF
Q1. An employee maintains a payroll spreadsheet with salary data for 2,400 employees on a personal laptop. What is the GREATEST risk?
A. The spreadsheet may contain calculation errors
B. Sensitive data is outside corporate security controls and subject to loss or theft
C. The employee is doing unnecessary work
D. The IT department does not know about the spreadsheet
Reveal Answer

✓ Correct: B

Sensitive personal data on an unmanaged, unencrypted personal device is a critical data protection and compliance risk. Loss or theft could result in a reportable breach. A and C are minor. D is a symptom of the larger control gap.

Q2. What is the BEST first step an IS auditor should recommend to address shadow IT?
A. Block all personal devices immediately
B. Implement a comprehensive end-user computing policy
C. Install monitoring software on all personal devices
D. Terminate employees using unauthorized tools
Reveal Answer

✓ Correct: B

A comprehensive EUC policy establishes governance without being punitive. Blocking all devices may disrupt operations. Monitoring personal devices raises privacy issues. Termination is disproportionate.

Q3. Which control would MOST effectively prevent sensitive data from being stored on unauthorized devices?
A. User awareness training
B. Data Loss Prevention (DLP) technology
C. Stronger passwords on corporate systems
D. More frequent IT audits
Reveal Answer

✓ Correct: B

DLP technology can detect and prevent sensitive data from being copied to unauthorized locations, including personal devices and cloud services. Training is helpful but not enforceable. Passwords and audits don't directly address the issue.

Shadow IT is everywhere. Could moving to the cloud fix all of this?

Part C Section 4.12

Cloud Computing

D4-12 — Cloud Kingdom

Cloud Service & Deployment Models

👩‍💼

Meridian is considering moving payroll to the cloud. The CFO corners Alex: “Would this have happened in the cloud?” Alex pauses. “Possibly. Just differently. The firewall misconfiguration? That becomes a security group misconfiguration. The missing monitoring? That’s still your responsibility. The cloud changes where the problems live, not whether they exist.”

A three-layered cloud kingdom — IaaS at bottom with servers, PaaS in middle with development tools, SaaS on top with happy users, deployment model castles labeled Public Private Hybrid Community
Service Models — Shared Responsibility
🏗️

IaaS

Infrastructure as a Service

  • Provider manages: hardware, networking, storage
  • Customer manages: OS, apps, data, middleware
  • Most customer responsibility
  • Example: AWS EC2, Azure VMs
🔧

PaaS

Platform as a Service

  • Provider manages: infra + OS + runtime
  • Customer manages: apps and data
  • Shared responsibility
  • Example: Heroku, Google App Engine

SaaS

Software as a Service

  • Provider manages: everything
  • Customer manages: data and access
  • Least customer responsibility
  • Example: Gmail, Salesforce, Office 365
Deployment Models

☁️ Public Cloud

Shared infrastructure, multi-tenant. Lowest cost, least control. Managed by third party.

🏰 Private Cloud

Dedicated to one organization. Most control, highest cost. Can be on-premises or hosted.

🔗 Hybrid Cloud

Combination of public and private. Sensitive data on private, burst to public for capacity.

🏘️ Community Cloud

Shared by organizations with common concerns (e.g., healthcare, government). Cost shared.

Key Cloud Audit Concerns

  • Data sovereignty — where is data stored?
  • Right to audit — contractual access
  • Vendor lock-in — portability risks
  • Multi-tenancy risks — data isolation
  • SOC 2 reports — assurance from provider
  • Incident notification — SLA obligations
Mnemonic — Cloud Responsibility Ladder

Think of a pizza analogy: IaaS = you buy ingredients and cook (most work). PaaS = you get a ready kitchen (just cook your recipe). SaaS = you order delivery (just eat). The more “as a Service,” the less you manage.

Key Exam Tip

Even when using cloud services, the ORGANIZATION remains responsible for its data. The cloud provider is responsible for infrastructure security, but the customer must still protect data and manage access. An auditor should request SOC 2 Type II reports from the cloud provider as primary assurance evidence.

📰 Real World

Capital One 2019 — 100M customer records breached from AWS. Misconfigured WAF was Capital One’s responsibility under shared responsibility model, not AWS’s.

✏️ TEST YOURSELF
Q1. A CFO asks whether moving payroll to the cloud would eliminate operational risks. What is the BEST response?
A. Yes, cloud providers handle all security
B. Cloud transfers some risks but significant responsibilities remain with the organisation under the shared responsibility model
C. No, cloud is less secure than on-premises
D. Only if they choose a private cloud
Reveal Answer

✓ Correct: B

The shared responsibility model means the organization always retains responsibility for data, access management, and application-level security. Cloud changes where risks live, not whether they exist.

Q2. Which cloud service model places the MOST security responsibility on the customer?
A. SaaS
B. PaaS
C. IaaS
D. All models share equal responsibility
Reveal Answer

✓ Correct: C

In IaaS, the customer manages OS, applications, data, and middleware. In PaaS, the customer manages apps and data. In SaaS, only data and access. IaaS = most customer responsibility.

Q3. An IS auditor needs assurance about a cloud provider’s controls. What is the MOST appropriate evidence to request?
A. The provider’s marketing materials
B. A SOC 2 Type II report
C. An email confirmation from the sales team
D. The provider’s ISO certification number
Reveal Answer

✓ Correct: B

SOC 2 Type II reports provide independent assurance over a period of time about the effectiveness of the provider’s controls. Marketing materials and emails are not audit evidence. ISO certification confirms a framework exists but not effectiveness over time.

🖥️

The payroll system is back. Alex has 11 findings.

By 1pm, everything is working again. The ops team is relieved. Alex is still writing. Three hours of incident taught her more about Meridian’s IT operations than three weeks of documentation review. The systems work — mostly. The controls around them are full of gaps. And somewhere in Finance, a laptop with 2,400 salary records is sitting unlocked on a desk. She adds that to finding number 12.

✓ IT Operations ✓ Infrastructure ✓ Network Management ✓ ITIL Service Management ✓ Database Management ✓ Performance Monitoring ✓ Business Continuity ✓ Disaster Recovery ✓ Backup & Recovery ✓ Incident Response ✓ End-User Computing ✓ Cloud Computing
Continue to Domain 5

⚠️ Top 10 Exam Traps — Domain 4

1
❌ “Hot site = instant recovery”
✓ Hot site has pre-configured equipment but still requires current backups to be loaded — not instantaneous
2
❌ “RPO = how long recovery takes”
✓ RPO = maximum acceptable DATA LOSS (age of restored data). RTO = how long RECOVERY takes
3
❌ “Full backup is always best”
✓ Full + incremental is often more practical. Best strategy depends on RPO
4
❌ “Incident management fixes root cause”
✓ Incident management restores SERVICE. Problem management finds ROOT CAUSE
5
❌ “Cloud eliminates operational risk”
✓ Cloud transfers some risks — under shared responsibility, significant risks remain with the organisation
6
❌ “Tested BCP = good BCP”
✓ Only if tested against realistic scenarios, kept current, includes all critical systems
7
❌ “Normalisation improves performance”
✓ Normalisation improves DATA INTEGRITY — it can hurt performance (more joins). Denormalisation is used for performance
8
❌ “Operations team owns BCP”
✓ BCP ownership = SENIOR MANAGEMENT. IT operations supports recovery
9
❌ “Encryption makes backups secure”
✓ Encryption + physical security + off-site storage + access control. All required
10
❌ “IDS and IPS both prevent intrusions”
✓ IDS detects and alerts (detective). IPS can block (preventive + detective)