Domain 4

IS Operations & Business Resilience

Keep the lights on — and know what to do when they go out. The largest CISA domain, covering everything from daily operations to disaster recovery.

Domain 4: IS Operations & Business Resilience 23%

👩‍💼

Alex's Week 4 — The Payday Outage

Alex survived three weeks of planning, governance reviews, and a runaway development project. This week was supposed to be quieter — an audit of IT operations and business continuity. Then at 9:47am on a Friday, the payroll system went down. Alex closed her laptop and walked to the server room. This was going to be the most educational Friday of her career.

D4-01 — Command Center

Domain 4 Overview

A bustling vintage command center with operators managing schedules, help desks, and tape reels

Part A: Operations (4.1–4.6)

4.1 IT Operations
4.2 Hardware & Infrastructure
4.3 Network Management
4.4 IT Service Management
4.5 Database Management
4.6 Performance Monitoring

Part B: Resilience (4.7–4.10)

4.7 Business Continuity
4.8 Disaster Recovery
4.9 Backup & Recovery
4.10 Incident Management

Part C: Emerging (4.11–4.12)

4.11 End-User Computing
4.12 Cloud Computing

Domain Weight: 23% (~34 questions) — Largest CISA Domain

This is the single biggest domain on the exam. Master BCP/DRP concepts, understand recovery objectives, and know your backup strategies inside-out.

Key Exam Tip

Domain 4 is the largest at 23% — expect roughly 34 questions. BCP/DRP and incident management are the most heavily tested areas. Know RTO vs. RPO, hot/warm/cold sites, and the incident response lifecycle by heart.

Part A Section 4.1

IT Operations

D4-01 — Operations Command Center

IT Operations Management

👩‍💼

Alex’s Story

The payroll batch job ran at 2am. Output was generated. But the file never reached the portal — and no alert fired. Alex checks the monitoring configuration: there is none. No completion check. No failure notification. She writes her first finding at 9:52am: “Basic operations monitoring: absent.”

📋

Job Scheduling

Automated batch job execution
Dependency management
SLA-driven scheduling
Auditor checks: authorization, logs, error handling

📞

Help Desk / Service Desk

Single point of contact (SPOC)
Ticket tracking & escalation
First-call resolution rate
Knowledge base management

🖨️

Operations Controls

Print/output management
Tape/media library controls
Lights-out operations
Segregation of duties in IT ops

Key Auditor Concern: Segregation of Duties (SoD)

Computer operators should not have access to modify programs, data, or system documentation. The auditor should verify that operations staff cannot make unauthorized changes to production.

Key Exam Tip

The most critical control in IT operations is segregation of duties. Operators should never have access to change programs or data. The exam loves questions about what operators should and should not be able to do.

📰 Real World

British Airways Bank Holiday 2017 outage — engineer accidentally powered down a data centre UPS. No automated monitoring detected it. 75,000 passengers stranded.

✏️ TEST YOURSELF

Q1. An IS auditor discovers that a batch job failed overnight but no alert was generated. What is the MOST significant risk?

A. Critical outputs may not be delivered, and failures go undetected indefinitely

B. The batch job will need to be rerun manually

C. Operators will have less work to do during the shift

D. The help desk will receive more calls than normal

Reveal Answer

✓ Correct: A

Without automated monitoring and alerting, job failures can go undetected for extended periods, leading to missing outputs and cascading impacts. B describes a consequence but not the primary risk. C and D are minor operational effects.

Q2. Which control BEST ensures that computer operators cannot make unauthorized changes to production programs?

A. Regular review of operator activity logs

B. Segregation of duties between operations and programming

C. Requiring operators to sign a code of conduct

D. Installing antivirus on operator workstations

Reveal Answer

✓ Correct: B

Segregation of duties is the primary preventive control. Operators should not have access to source code or production program libraries. A is detective, not preventive. C is administrative but not enforceable. D is unrelated.

Q3. An organization runs lights-out (unattended) operations overnight. What is the FIRST control an IS auditor should verify?

A. That physical access to the data centre is restricted

B. That automated job scheduling and monitoring tools are properly configured

C. That operators are available by phone

D. That backup tapes are rotated nightly

Reveal Answer

✓ Correct: B

In lights-out operations, automated monitoring is the primary control because no humans are present. Without it, failures go undetected. A is important but secondary. C partially defeats the purpose of lights-out. D is a separate concern.

The batch job ran fine — the infrastructure underneath it is the next question.

Part A Section 4.2

Hardware & Infrastructure

D4-02 — Data Center City

Hardware & Infrastructure Components

👩‍💼

Alex’s Story

Alex pulls up the infrastructure documentation. Payroll runs on a single physical server. No clustering. No failover. The infrastructure diagram is dated 2019. “Some things may have changed,” the infrastructure manager admits. Alex notes finding number two: single point of failure, no redundancy, stale documentation.

A futuristic data center city with server towers, SAN storage, NAS filing cabinet, and IaaS PaaS SaaS clouds

🗄️

SAN (Storage Area Network)

Dedicated high-speed storage network
Block-level access
Used for databases, email servers
Expensive but high-performance

📁

NAS (Network Attached Storage)

File-level access over LAN
Shared storage for file servers
Simpler and cheaper than SAN
Uses TCP/IP protocols

Server Types & Configurations

Mainframe

Centralized, high-reliability processing for critical batch jobs and transaction processing.

Client-Server

Distributed processing: thin clients (server-dependent) vs. thick clients (local processing).

Virtualization

Multiple virtual servers on one physical host. Key risk: hypervisor compromise.

Key Exam Tip

Know SAN vs. NAS: SAN = block-level, dedicated network, high performance. NAS = file-level, over existing LAN, simpler. Virtualization’s biggest risk is a compromised hypervisor, which could affect all hosted VMs.

📰 Real World

2021 Facebook 6-hour outage — BGP configuration change took down DNS servers. The team couldn’t remotely fix it because access tools also ran on affected infrastructure.

✏️ TEST YOURSELF

Q1. An IS auditor finds that a critical payroll application runs on a single physical server with no failover. What should be the auditor's PRIMARY recommendation?

A. Implement server clustering or redundancy for critical systems

B. Increase the server's processing power

C. Schedule more frequent backups

D. Document the current configuration

Reveal Answer

✓ Correct: A

A single point of failure for a critical system is an unacceptable risk. Clustering provides redundancy. B improves performance but not availability. C helps recovery but not prevention. D is useful but doesn't address the risk.

Q2. What is the GREATEST risk of server virtualization from an audit perspective?

A. Virtual machines use more storage than physical servers

B. A compromised hypervisor could affect all hosted virtual machines

C. Virtualization software requires frequent updates

D. Virtual servers cannot be backed up easily

Reveal Answer

✓ Correct: B

The hypervisor is the single point of control for all VMs. If compromised, every VM on that host is at risk. A is incorrect. C is a maintenance concern, not the greatest risk. D is false — virtual servers can be backed up.

Q3. Which storage technology provides block-level access over a dedicated high-speed network?

A. NAS (Network Attached Storage)

B. SAN (Storage Area Network)

C. DAS (Direct Attached Storage)

D. Cloud object storage

Reveal Answer

✓ Correct: B

SAN provides block-level access over a dedicated network, ideal for high-performance databases. NAS provides file-level access over LAN. DAS is directly connected to one server. Cloud object storage is accessed via APIs.

The server is there. But something between it and the users is broken — the network.

Part A Section 4.3

Network Management

D4-03 — Seven-Story Building

The OSI Model — 7 Layers

👩‍💼

Alex’s Story

A firewall rule was changed at 2am — blocking the payroll server’s outbound connection to the portal. Nobody reviewed the change. Nobody approved it. The logs showed the block clearly, but nobody was watching. Alex writes finding number three: uncontrolled change to a network device with no review process.

A seven-story building cutaway with each floor representing an OSI layer, workers passing data packages between floors

7 Application — HTTP, FTP, SMTP, DNS — what users interact with

6 Presentation — Encryption, compression, data formatting (SSL/TLS)

5 Session — Manages connections/sessions between applications

4 Transport — TCP/UDP, flow control, error recovery, segmentation

3 Network — IP addressing, routing (routers operate here)

2 Data Link — MAC addresses, frames (switches operate here)

1 Physical — Cables, hubs, electrical signals, bits on the wire

Mnemonic — Remember from Layer 7 down

Use this phrase to remember all 7 layers top-to-bottom:

“All People Seem To Need Data Processing”

Or bottom-up: “Please Do Not Throw Sausage Pizza Away” (Physical → Application)

Key Network Devices

Hub — Layer 1, broadcasts to all
Switch — Layer 2, uses MAC addresses
Router — Layer 3, uses IP addresses
Firewall — Layers 3-7, filters traffic

Network Types

LAN — Local, same building/campus
WAN — Wide, connects distant LANs
VPN — Encrypted tunnel over public network
VLAN — Logical segmentation of a LAN

Key Exam Tip

Know which devices operate at which OSI layers: hubs at Layer 1, switches at Layer 2, routers at Layer 3, firewalls at Layers 3-7. The exam frequently tests which layer a specific technology or attack targets.

📰 Real World

Knight Capital 2012 — misconfigured routing on 1 of 8 servers caused $440M loss in 45 minutes. Configuration changes without review are one of the highest-risk IT operations.

✏️ TEST YOURSELF

Q1. A firewall rule change at 2am blocks a critical server. No change approval exists. What control failure does this BEST illustrate?

A. Lack of intrusion detection

B. Absence of change management for network devices

C. Inadequate firewall technology

D. Poor network segmentation

Reveal Answer

✓ Correct: B

The core issue is an unapproved, unreviewed change. Change management requires approval, testing, and rollback plans before any production change. A is about detection, not prevention. C and D are unrelated to the approval process.

Q2. An IS auditor reviews network documentation and finds that the organisation uses a flat network with all servers, workstations, and IoT devices on the same subnet. What is the GREATEST risk?

A. Network performance degradation due to broadcast traffic

B. Lateral movement — a compromised device can directly reach all other systems

C. Difficulty in assigning static IP addresses

D. Inability to implement wireless networking

Reveal Answer

✓ Correct: B

Without network segmentation, an attacker who compromises any device (e.g., an IoT sensor) can reach critical servers directly. This is the greatest security risk of a flat network. Performance issues (A) are operational, not the greatest risk. IP addressing (C) and wireless (D) are unrelated to the flat network topology.

Q3. An IS auditor reviews firewall logs and finds that blocked traffic is logged but never reviewed. What is the PRIMARY risk?

A. The firewall will run out of storage

B. Security incidents may go undetected despite being logged

C. The firewall performance will degrade

D. Compliance reports will be incomplete

Reveal Answer

✓ Correct: B

Logging without review provides no security value. The logs captured the blocked payroll connection, but nobody was watching. A and C are operational concerns. D is secondary to the detection gap.

The firewall blocked the connection. But how is the team even managing this incident?

Part A Section 4.4

IT Service Management (ITIL)

D4-04 — ITIL Workshop

ITIL Service Management Processes

👩‍💼

Alex’s Story

One hour into the outage. Alex asks for the incident record. There isn’t one. The team is coordinating via WhatsApp. No ticket number. No severity classification. No escalation timeline. No formal ITIL process whatsoever. “We just fix things,” says the team lead. Alex adds finding number four.

A hospital emergency room with four clearly labeled stations: Incident Desk, Problem Clinic, Change Committee, Configuration Pharmacy

🚨

Incident Management

“Restore service ASAP”

Focus: restore normal operations quickly
Does NOT find root cause
Workarounds are acceptable
Measured by: time to restore

🔍

Problem Management

“Find and fix the root cause”

Focus: identify underlying cause
Prevents recurrence
Creates known error database (KEDB)
Proactive & reactive modes

✅

Change Management

“Control all changes”

Submit RFC (Request for Change)
Impact analysis & approval
CAB (Change Advisory Board)
Rollback plan required

📦

Configuration Management

“Know what you have”

CMDB — Configuration Management DB
Tracks all IT assets (CIs)
Baseline configurations
Supports all other ITIL processes

Mnemonic — Incident vs. Problem

Think of it like a hospital: Incident = Emergency Room (stop the bleeding now), Problem = Diagnostic Lab (find out why the patient keeps getting sick). Emergency first, diagnosis second.

Key Exam Tip

The #1 tested ITIL concept: Incident management restores service (workaround OK), problem management finds root cause (permanent fix). Change management requires a rollback plan. Every change must go through the CAB or emergency CAB (ECAB).

📰 Real World

2003 Northeast blackout — utilities without formal incident management took days to restore power. Those with structured escalation had significantly faster recovery.

✏️ TEST YOURSELF

Q1. An IS auditor finds the IT team manages incidents via a WhatsApp group with no formal logging. What is the PRIMARY risk?

A. Slow resolution times

B. Loss of audit trail and inability to analyse patterns

C. Poor team communication

D. Escalation delays

Reveal Answer

✓ Correct: B

Without formal incident logging, there is no audit trail, no data for trend analysis, and no evidence of compliance. A, C, and D are possible effects but the fundamental risk is the loss of documentation and pattern analysis capability.

Q2. After a major outage is resolved, what ITIL process should be initiated to prevent recurrence?

A. Incident management

B. Change management

C. Problem management

D. Configuration management

Reveal Answer

✓ Correct: C

Problem management investigates root causes and prevents recurrence. Incident management only restores service. Change management handles modifications. Configuration management tracks assets.

Q3. A change to a firewall rule caused a payroll outage. Which control would MOST likely have prevented this?

A. More frequent backups

B. Change Advisory Board (CAB) review before implementation

C. Stronger firewall hardware

D. Additional firewall rules

Reveal Answer

✓ Correct: B

CAB review ensures changes are assessed for impact, approved, and have rollback plans before implementation. A is about recovery, not prevention. C and D don't address the process failure.

The process is broken. But what about the data sitting inside the payroll database?

Part A Section 4.5

Database Management

D4-05 — Data Cathedral

Database Management Systems

👩‍💼

Alex’s Story

The payroll data is intact — the database wasn’t the problem. But querying it requires restarting the DBMS, which takes 40 minutes. Alex asks when the last database health check was performed. “We’d know if something was wrong,” says the DBA. Alex writes: reactive posture, no proactive monitoring or integrity checks.

A cathedral-like library with monks normalizing records, concurrency control gates, and a glowing DBMS engine

Normalization Levels

1NF

Eliminate repeating groups — atomic values only

→

2NF

Remove partial dependencies on composite keys

→

3NF

Remove transitive dependencies — every non-key depends only on the primary key

Concurrency Controls

Locking — Prevents simultaneous updates
Deadlock — Two processes waiting for each other
Timestamping — Orders transactions by time
Optimistic — Allow all, check at commit

Data Integrity

Referential — Foreign keys match primary keys
Entity — Primary keys are unique, not null
Domain — Values within valid range
ACID — Atomicity, Consistency, Isolation, Durability

Mnemonic — ACID Properties

Atomicity — all or nothing. Consistency — valid state before and after. Isolation — transactions don’t interfere. Durability — once committed, it’s permanent. Think of a bank transfer: it must be all four or your money vanishes.

Key Exam Tip

Know ACID properties cold — especially Atomicity (all or nothing). For normalization, 3NF is the exam standard: “every non-key attribute depends on the key, the whole key, and nothing but the key.” Referential integrity = foreign key validity.

📰 Real World

Equifax 2017 — database activity monitoring was misconfigured, generating no alerts for 78 days while 147M records were exfiltrated.

✏️ TEST YOURSELF

Q1. A DBA states “We’d know if something was wrong” regarding database health. What does this attitude MOST indicate?

A. Confidence in the DBMS product

B. A reactive rather than proactive monitoring posture

C. Adequate knowledge of database administration

D. That automated alerts are functioning properly

Reveal Answer

✓ Correct: B

The statement reveals reliance on reactive detection rather than proactive health checks, integrity validation, and monitoring. Without scheduled checks, problems may go undetected until they cause outages.

Q2. An IS auditor discovers that a database administrator has direct write access to production financial tables and no audit trail exists for DBA activities. What is the MOST significant risk?

A. Database performance may be degraded by DBA queries

B. The DBA could modify financial data without detection

C. The database schema may become inconsistent

D. Backup procedures may not capture DBA changes

Reveal Answer

✓ Correct: B

Direct write access combined with no audit trail creates a fraud risk — the DBA could alter financial records with no way to detect the changes. This is a critical segregation-of-duties and detective control failure. Performance (A) is operational. Schema issues (C) are a technical concern. Backup gaps (D) are secondary to the data integrity risk.

Q3 (TRAP). An IS auditor finds that a customer database stores the same customer address in five different tables, and discrepancies exist between them. An analyst suggests normalising the database to fix this. What should the auditor's PRIMARY concern be?

A. Normalisation will slow down query performance

B. The data integrity issues caused by the current redundancy

C. The cost of redesigning the database

D. Whether the development team has normalisation skills

Reveal Answer

✓ Correct: B

The auditor's primary concern is the data integrity risk from redundant, inconsistent data — this is what drives the need for action. The trap is A — while normalisation can affect performance, the auditor should focus on the data integrity risk first. Cost (C) and team skills (D) are implementation considerations, not the auditor's primary concern.

The database is fine. But is anyone actually watching the system’s vital signs?

Part A Section 4.6

System Performance Monitoring

D4-06 — Observatory Tower

Performance Monitoring & Capacity Planning

👩‍💼

Alex’s Story

Alex asks to see the performance dashboard. There isn’t one. The team uses command-line tools when something feels slow. What’s the payroll system’s SLA for availability? Nobody knows. It was never defined. No baseline metrics, no capacity planning, no thresholds. Finding number six: no performance monitoring framework.

An observatory tower with operators watching performance dashboards, capacity meters, and a telescope pointed at future needs

📊

Key Performance Metrics

CPU utilization — processor workload %
Memory usage — RAM consumption
I/O throughput — disk read/write speed
Network latency — response delay
Transaction rate — TPS (transactions/sec)

🔭

Capacity Planning

Forecast future resource needs
Trend analysis on historical data
Model “what-if” scenarios
Balance cost vs. performance
Prevent bottlenecks proactively

Service Level Agreements (SLAs)

Uptime / Availability

99.9% = 8.77 hours downtime/year. 99.99% = 52.56 minutes/year. Know the “nines.”

Response Time

Maximum acceptable delay for user transactions. Measured at the user interface level.

Throughput

Volume of work completed per unit time. Must handle peak loads without degradation.

Key Exam Tip

SLAs define the minimum acceptable service levels. An auditor should verify that SLAs are documented, measurable, monitored, and that penalties for non-compliance are defined. Capacity planning should be proactive, not reactive.

📰 Real World

Delta Air Lines 2016 — power control module failure cascaded for hours because no real-time monitoring thresholds triggered an alert. Cost: $150M+.

✏️ TEST YOURSELF

Q1. An IS auditor finds that no SLA exists for a critical payroll system. What is the MOST significant consequence?

A. Users cannot complain about downtime

B. There is no agreed baseline to measure performance or availability against

C. The system will perform poorly

D. IT staff have no motivation to fix issues

Reveal Answer

✓ Correct: B

Without an SLA, there is no defined acceptable level of service, making it impossible to measure compliance, trigger escalations, or hold anyone accountable. A, C, and D are secondary effects.

Q2. Which capacity planning approach is MOST appropriate for preventing future performance issues?

A. Adding resources after users complain

B. Proactive trend analysis of historical performance data

C. Purchasing the most powerful hardware available

D. Limiting the number of users on the system

Reveal Answer

✓ Correct: B

Proactive trend analysis uses historical data to predict future needs before problems occur. A is reactive. C is wasteful without data-driven needs. D restricts business operations.

Q3. A system has a 99.9% uptime SLA. What is the maximum acceptable annual downtime?

A. 52.56 minutes

B. 5.26 hours

C. 8.77 hours

D. 3.65 days

Reveal Answer

✓ Correct: C

99.9% uptime allows 0.1% downtime = 8,760 hours x 0.001 = 8.76 hours per year. 99.99% = 52.56 minutes. 99% = 3.65 days.

Operations are unmonitored. Now Alex turns to the bigger question: what if the whole building goes down?

Part B Section 4.7

Business Continuity Planning (BCP)

D4-07 — Medieval Town Prepares

BCP Lifecycle & Business Impact Analysis

👩‍💼

Alex’s Story

Alex asks to speak with the BCP coordinator. He’s on leave. The deputy doesn’t know where the BCP document is stored. When it’s finally located on a shared drive, it references a hot site whose contract was not renewed eighteen months ago. The BCP was last tested two years ago. Alex’s findings are multiplying.

A medieval town preparing for a storm with scribes analyzing critical assets, a plan-test-maintain-improve wheel, and scouts on watchtowers

BCP Lifecycle

Project Initiation

Senior management sponsorship, scope definition, team formation

→

BIA

Business Impact Analysis — identify critical functions & maximum tolerable downtime

→

Strategy Development

Select recovery strategies for each critical function

→

Plan Development

Document procedures, roles, responsibilities, contacts

→

Testing & Training

Exercise the plan, train staff, validate recovery

→

Maintenance

Regular updates, reviews, and improvements

Business Impact Analysis (BIA) — The Foundation

The BIA is the first major step after project initiation. It identifies:

Critical business processes
Maximum Tolerable Downtime (MTD)
RTO & RPO for each process

Resource dependencies
Financial impact of downtime
Operational impact over time

BCP Test Types (Least to Most Disruptive)

Checklist

Review the plan on paper

→

Walkthrough

Tabletop discussion of the plan

→

Simulation

Rehearse specific scenarios

→

Parallel

Activate recovery site alongside primary

→

Full Interruption

Shut down primary, switch to backup — highest risk

Key Exam Tip

BIA must be completed BEFORE developing recovery strategies. Senior management must sponsor the BCP — this is non-negotiable. The most realistic test is a full interruption test, but a parallel test is the most common realistic test because it doesn’t disrupt primary operations.

📰 Real World

Hurricane Sandy 2012 — several financial institutions found their BCP alternate sites were in the same flood zone as primary sites.

✏️ TEST YOURSELF

Q1. An IS auditor discovers that a BCP references a hot site whose contract expired 18 months ago. What does this BEST indicate?

A. The hot site is no longer needed

B. The BCP maintenance process has failed

C. The BCP was never properly approved

D. The hot site vendor went out of business

Reveal Answer

✓ Correct: B

An expired recovery site contract in the BCP indicates that the plan is not being maintained and updated. Regular review should catch such gaps. A, C, and D are assumptions not supported by the facts.

Q2. Which BCP test type provides the MOST realistic validation without disrupting primary operations?

A. Checklist review

B. Tabletop walkthrough

C. Parallel test

D. Full interruption test

Reveal Answer

✓ Correct: C

A parallel test activates the recovery site while primary operations continue, providing realistic validation without risk. Full interruption is most realistic but disrupts operations. Checklist and walkthrough are theoretical exercises.

Q3. What must be completed BEFORE developing BCP recovery strategies?

A. Selection of a hot site vendor

B. Business Impact Analysis (BIA)

C. Full interruption test

D. Staff training on recovery procedures

Reveal Answer

✓ Correct: B

The BIA identifies critical functions, MTD, RTO, and RPO — all of which drive recovery strategy selection. You cannot choose a strategy without knowing what you're protecting and how quickly it must recover.

The BCP is a paper exercise. What about the actual disaster recovery specifics?

Part B Section 4.8

Disaster Recovery Planning (DRP)

D4-08 — The Hourglass

RTO, RPO & Recovery Sites

👩‍💼

Alex’s Story

Alex asks the critical questions. RTO for payroll? Nobody knows. RPO? Nobody knows either. Backups run daily at 3am. It’s now 11am. If they had to restore, that’s 8 hours of data loss. For a payroll system processing 2,400 employees, that’s unacceptable — but nobody ever defined what “acceptable” means.

A dramatic split hourglass with RPO showing data falling backwards toward last backup, RTO showing recovery teams racing forward, hot/warm/cold sites in background

←

RPO — Recovery Point Objective

“How much data can we lose?”

Looks BACKWARD from disaster
Maximum tolerable data loss
Determines backup frequency
RPO = 4 hrs → backup every 4 hrs

→

RTO — Recovery Time Objective

“How fast must we recover?”

Looks FORWARD from disaster
Maximum tolerable downtime
Determines recovery site type
RTO = 0 → need a hot site

Recovery Site Types

Hardware

Fully equipped

Partially equipped

Empty shell

Data

Near real-time

Recent backup

None

Ready in

Minutes to hours

Hours to days

Days to weeks

Cost

$$$ Highest

$$ Moderate

$ Lowest

Best for

Zero downtime tolerance

Hours-level RTO

Days/weeks RTO OK

Mnemonic — RPO vs. RTO

RPO = Recovery Point = the point in the past you rewind to (data loss). RTO = Recovery Time = time to get back up (downtime). Point looks back, Time looks forward.

Key Exam Tip

This is the most heavily tested concept in Domain 4. RPO drives backup strategy, RTO drives recovery site selection. A hot site is ready immediately but is the most expensive. Also know: reciprocal agreements are least reliable because they depend on another organization’s capacity.

📰 Real World

2011 Thailand floods destroyed hard drive plants. Companies without documented RTO/RPO had no criteria for activating alternate suppliers.

✏️ TEST YOURSELF

Q1. A company’s RPO for its payroll system is 4 hours. Daily backups run at 3am. A failure occurs at 11am. What is the data loss exposure?

A. 4 hours

B. 8 hours

C. 11 hours

D. 24 hours

Reveal Answer

✓ Correct: B

The last backup was at 3am. Failure at 11am means 8 hours of data since the last backup. This exceeds the 4-hour RPO, indicating the backup frequency is inadequate for the defined RPO.

Q2. An organization requires near-zero downtime for its trading platform. Which recovery site type is MOST appropriate?

A. Cold site

B. Warm site

C. Hot site

D. Reciprocal agreement

Reveal Answer

✓ Correct: C

A hot site has fully configured hardware with near real-time data replication, enabling recovery in minutes to hours. Cold sites take days/weeks. Warm sites take hours/days. Reciprocal agreements are the least reliable option.

Q3 (TRAP). A company's payroll system has an RPO of 4 hours and performs daily backups at midnight. A system failure occurs at 3pm. What is the MOST significant finding?

A. The backup schedule meets the RPO requirement

B. Up to 15 hours of data could be lost, far exceeding the 4-hour RPO

C. The RTO has been exceeded

D. Daily backups are industry standard and therefore acceptable

Reveal Answer

✓ Correct: B

With a midnight backup and a 3pm failure, up to 15 hours of data could be lost. The RPO of 4 hours means the organisation can only tolerate losing 4 hours of data — the backup frequency is grossly inadequate. The trap is A — daily backups do NOT meet a 4-hour RPO. C is wrong because RTO measures recovery time, not data loss. D appeals to a false sense of "industry standard."

Recovery objectives are undefined. But what about the actual backups — do they even work?

Part B Section 4.9

Backup & Recovery

D4-09 — The Library Copyists

Backup Types & Strategies

👩‍💼

Alex’s Story

Last backup: last night at 3am. Last backup test: 14 months ago. The team can confirm backups complete, but nobody has verified that a restore actually produces a working system. Alex notes this separately — it’s a finding that stands on its own. A backup you’ve never tested is a hope, not a control.

Three filing cabinet drawers labeled Full, Incremental, and Differential with a technician deciding which drawer to open during recovery

📚

Full Backup

Copies ALL data every time
Slowest to create
Fastest to restore
Resets archive bit

📗

Incremental

Only data changed since LAST backup (any type)
Fastest to create
Slowest to restore (need all incrementals)
Resets archive bit

📙

Differential

Data changed since last FULL backup
Grows larger each day
Restore = full + latest differential
Does NOT reset archive bit

Restore Speed Comparison

Full

Slowest (copies everything)

Fastest (one tape/set)

Incremental

Fastest (only changes)

Slowest (full + all incrementals)

Differential

Medium (growing daily)

Medium (full + last differential)

Off-site Storage & Media Rotation

Grandfather-Father-Son (GFS) — daily/weekly/monthly rotation
Off-site vaulting — store backups at a separate location

Electronic vaulting — batch transfer to off-site
Remote journaling — real-time transaction log transfer

Key Exam Tip

Key distinction: Incremental backs up since the LAST backup (any type), Differential backs up since the last FULL backup. For recovery, differential is faster than incremental (need only 2 sets vs. potentially many). Off-site backups are essential — same building = same disaster.

📰 Real World

GitLab 2017 — accidentally deleted primary database. Backup system had been failing silently for months. Only 5GB of 300GB was recoverable. Regular testing, not just running, matters.

✏️ TEST YOURSELF

Q1. An IS auditor verifies that backups run nightly. What additional evidence BEST confirms backup adequacy?

A. Backup logs showing completion

B. Evidence of successful restoration tests

C. Offsite storage receipts

D. Backup software vendor certification

Reveal Answer

✓ Correct: B

Successful restoration tests prove that backups actually work and can produce a usable system. Completion logs only confirm the backup ran, not that data is recoverable. Offsite storage and vendor certification don't verify recoverability.

Q2. Which backup type requires the MOST media sets for a full restore?

A. Full backup

B. Differential backup

C. Incremental backup

D. Mirror backup

Reveal Answer

✓ Correct: C

Incremental restore requires the last full backup plus every incremental since then (potentially many sets). Differential needs only full + last differential (2 sets). Full needs only one set.

Q3. A backup system has been completing nightly for 14 months without a restoration test. What is the PRIMARY risk?

A. Backup media may have degraded

B. Backups may not be restorable, and the organization won’t know until a real disaster

C. Storage costs are being wasted

D. The backup schedule may be inefficient

Reveal Answer

✓ Correct: B

Without restoration testing, there is no assurance that backups produce working systems. GitLab's 2017 incident proved that running backups and having working backups are very different things.

Backups exist but are untested. Three hours in — the team finally finds the fix.

Part B Section 4.10

Incident Management

D4-10 — Fire Station Response

Incident Response Lifecycle

👩‍💼

Alex’s Story

Three hours in, the misconfigured firewall rule is identified and reversed. The payroll portal comes back online. The team relaxes. Alex does not. She has been watching the entire response unfold without a single documented step. She now has 11 findings — and the incident itself just became finding material for how not to run incident response.

5-Phase Incident Response Process

Detection

Identify the incident through monitoring, alerts, or reports

→

Containment

Limit the damage — isolate affected systems

→

Eradication

Remove the root cause — clean malware, patch vulnerability

→

Recovery

Restore systems to normal operations, verify integrity

→

Lessons Learned

Post-mortem review — what went wrong, what to improve

Mnemonic — DCERL

Remember the order with: “Don’t Cry, Every Recovery Leads” to better security.

“Detect → Contain → Eradicate → Recover → Learn”

Incident Response Team (IRT)

Pre-designated team with clear roles
Must have management authority
Cross-functional (IT, legal, HR, PR)
24/7 contact information maintained
Regular training and exercises

Key Documentation

Incident log with timestamps
Chain of custody for evidence
Classification & prioritization
Communication plan (internal & external)
Post-incident report

Key Exam Tip

After detection, the FIRST priority is always containment — stop the bleeding before investigating. “Lessons learned” is the most frequently skipped step in practice — the exam considers it mandatory. Preserve evidence chain of custody for potential legal proceedings.

📰 Real World

Target 2013 — FireEye alerts were dismissed. Breach persisted 16 days. Detection without response is not security.

✏️ TEST YOURSELF

Q1. After detecting a security incident, what should be the IS auditor’s FIRST recommended action?

A. Begin forensic analysis

B. Contain the incident to prevent further damage

C. Notify law enforcement

D. Document the root cause

Reveal Answer

✓ Correct: B

Containment is always the first priority after detection — limit the blast radius. Forensics and root cause come after containment. Law enforcement notification depends on the incident type and may not be the first step.

Q2. Which phase of incident response is MOST commonly skipped in practice but considered mandatory by ISACA?

A. Detection

B. Containment

C. Eradication

D. Lessons learned

Reveal Answer

✓ Correct: D

Organizations often rush back to normal operations and skip the post-incident review. ISACA considers lessons learned essential for improving the incident response process and preventing recurrence.

Q3. An incident response team resolves a payroll outage in 3 hours but documents nothing during the process. What is the GREATEST risk?

A. The team cannot be rewarded for their work

B. No audit trail exists for analysis, pattern detection, or regulatory compliance

C. The team will forget the technical steps

D. Management will not know the outage occurred

Reveal Answer

✓ Correct: B

Without documentation, there is no evidence for audit, no data for trend analysis, no compliance evidence, and no basis for lessons learned. The other options are secondary consequences.

The system is back. But Alex has noticed something troubling on a Finance laptop.

Part C Section 4.11

End-User Computing & Shadow IT

D4-11 — Light vs. Shadow

End-User Computing Risks & Controls

👩‍💼

Alex’s Story

While the payroll portal was down, Finance pulled out their backup plan: a parallel payroll spreadsheet maintained “just in case.” It’s on a personal laptop. Unsecured. Unencrypted. It contains salary data for all 2,400 employees. The Finance manager doesn’t see the problem. Alex sees finding number eleven — and possibly the worst one yet.

Split office scene — bright side with approved IT systems and guards, shadowy side with unauthorized devices and personal cloud apps, auditor with flashlight in the middle

⚠️

Shadow IT Risks

Unpatched, unmanaged devices
Data stored outside corporate control
No backup or recovery
Compliance violations (GDPR, HIPAA)
Unauthorized cloud services (SaaS)
Spreadsheet-based “applications”

🛡️

Recommended Controls

End-user computing policy
Application whitelisting
CASB (Cloud Access Security Broker)
DLP (Data Loss Prevention)
Regular discovery and inventory
User awareness training

EUC Application Risks

End-user developed applications (spreadsheets, Access databases, macros) are high-risk because they typically lack:

Change management controls
Version control
Input validation

Documentation
Access controls
Audit trails

Key Exam Tip

The greatest risk of end-user computing is lack of IT controls — no change management, no backups, no audit trails. Shadow IT bypasses all corporate governance. The auditor’s first recommendation should be a comprehensive EUC policy, not banning personal devices outright.

📰 Real World

UK government 2012 — fined £150,000 after employee stored 24,000 people’s personal data on an unencrypted personal laptop that was stolen.

✏️ TEST YOURSELF

Q1. An employee maintains a payroll spreadsheet with salary data for 2,400 employees on a personal laptop. What is the GREATEST risk?

A. The spreadsheet may contain calculation errors

B. Sensitive data is outside corporate security controls and subject to loss or theft

C. The employee is doing unnecessary work

D. The IT department does not know about the spreadsheet

Reveal Answer

✓ Correct: B

Sensitive personal data on an unmanaged, unencrypted personal device is a critical data protection and compliance risk. Loss or theft could result in a reportable breach. A and C are minor. D is a symptom of the larger control gap.

Q2. What is the BEST first step an IS auditor should recommend to address shadow IT?

A. Block all personal devices immediately

B. Implement a comprehensive end-user computing policy

C. Install monitoring software on all personal devices

D. Terminate employees using unauthorized tools

Reveal Answer

✓ Correct: B

A comprehensive EUC policy establishes governance without being punitive. Blocking all devices may disrupt operations. Monitoring personal devices raises privacy issues. Termination is disproportionate.

Q3. Which control would MOST effectively prevent sensitive data from being stored on unauthorized devices?

A. User awareness training

B. Data Loss Prevention (DLP) technology

C. Stronger passwords on corporate systems

D. More frequent IT audits

Reveal Answer

✓ Correct: B

DLP technology can detect and prevent sensitive data from being copied to unauthorized locations, including personal devices and cloud services. Training is helpful but not enforceable. Passwords and audits don't directly address the issue.

Shadow IT is everywhere. Could moving to the cloud fix all of this?

Part C Section 4.12

Cloud Computing

D4-12 — Cloud Kingdom

Cloud Service & Deployment Models

👩‍💼

Alex’s Story

Meridian is considering moving payroll to the cloud. The CFO corners Alex: “Would this have happened in the cloud?” Alex pauses. “Possibly. Just differently. The firewall misconfiguration? That becomes a security group misconfiguration. The missing monitoring? That’s still your responsibility. The cloud changes where the problems live, not whether they exist.”

A three-layered cloud kingdom — IaaS at bottom with servers, PaaS in middle with development tools, SaaS on top with happy users, deployment model castles labeled Public Private Hybrid Community

Service Models — Shared Responsibility

🏗️

IaaS

Infrastructure as a Service

Provider manages: hardware, networking, storage
Customer manages: OS, apps, data, middleware
Most customer responsibility
Example: AWS EC2, Azure VMs

🔧

PaaS

Platform as a Service

Provider manages: infra + OS + runtime
Customer manages: apps and data
Shared responsibility
Example: Heroku, Google App Engine

✨

SaaS

Software as a Service

Provider manages: everything
Customer manages: data and access
Least customer responsibility
Example: Gmail, Salesforce, Office 365

Deployment Models

☁️ Public Cloud

Shared infrastructure, multi-tenant. Lowest cost, least control. Managed by third party.

🏰 Private Cloud

Dedicated to one organization. Most control, highest cost. Can be on-premises or hosted.

🔗 Hybrid Cloud

Combination of public and private. Sensitive data on private, burst to public for capacity.

🏘️ Community Cloud

Shared by organizations with common concerns (e.g., healthcare, government). Cost shared.

Key Cloud Audit Concerns

Data sovereignty — where is data stored?
Right to audit — contractual access
Vendor lock-in — portability risks

Multi-tenancy risks — data isolation
SOC 2 reports — assurance from provider
Incident notification — SLA obligations

Mnemonic — Cloud Responsibility Ladder

Think of a pizza analogy: IaaS = you buy ingredients and cook (most work). PaaS = you get a ready kitchen (just cook your recipe). SaaS = you order delivery (just eat). The more “as a Service,” the less you manage.

Key Exam Tip

Even when using cloud services, the ORGANIZATION remains responsible for its data. The cloud provider is responsible for infrastructure security, but the customer must still protect data and manage access. An auditor should request SOC 2 Type II reports from the cloud provider as primary assurance evidence.

📰 Real World

Capital One 2019 — 100M customer records breached from AWS. Misconfigured WAF was Capital One’s responsibility under shared responsibility model, not AWS’s.

✏️ TEST YOURSELF

Q1. A CFO asks whether moving payroll to the cloud would eliminate operational risks. What is the BEST response?

A. Yes, cloud providers handle all security

B. Cloud transfers some risks but significant responsibilities remain with the organisation under the shared responsibility model

C. No, cloud is less secure than on-premises

D. Only if they choose a private cloud

Reveal Answer

✓ Correct: B

The shared responsibility model means the organization always retains responsibility for data, access management, and application-level security. Cloud changes where risks live, not whether they exist.

Q2. Which cloud service model places the MOST security responsibility on the customer?

A. SaaS

B. PaaS

C. IaaS

D. All models share equal responsibility

Reveal Answer

✓ Correct: C

In IaaS, the customer manages OS, applications, data, and middleware. In PaaS, the customer manages apps and data. In SaaS, only data and access. IaaS = most customer responsibility.

Q3. An IS auditor needs assurance about a cloud provider’s controls. What is the MOST appropriate evidence to request?

A. The provider’s marketing materials

B. A SOC 2 Type II report

C. An email confirmation from the sales team

D. The provider’s ISO certification number

Reveal Answer

✓ Correct: B

SOC 2 Type II reports provide independent assurance over a period of time about the effectiveness of the provider’s controls. Marketing materials and emails are not audit evidence. ISO certification confirms a framework exists but not effectiveness over time.

🖥️

The payroll system is back. Alex has 11 findings.

By 1pm, everything is working again. The ops team is relieved. Alex is still writing. Three hours of incident taught her more about Meridian’s IT operations than three weeks of documentation review. The systems work — mostly. The controls around them are full of gaps. And somewhere in Finance, a laptop with 2,400 salary records is sitting unlocked on a desk. She adds that to finding number 12.

✓ IT Operations ✓ Infrastructure ✓ Network Management ✓ ITIL Service Management ✓ Database Management ✓ Performance Monitoring ✓ Business Continuity ✓ Disaster Recovery ✓ Backup & Recovery ✓ Incident Response ✓ End-User Computing ✓ Cloud Computing

Continue to Domain 5

⚠️ Top 10 Exam Traps — Domain 4

❌ “Hot site = instant recovery”
✓ Hot site has pre-configured equipment but still requires current backups to be loaded — not instantaneous

❌ “RPO = how long recovery takes”
✓ RPO = maximum acceptable DATA LOSS (age of restored data). RTO = how long RECOVERY takes

❌ “Full backup is always best”
✓ Full + incremental is often more practical. Best strategy depends on RPO

❌ “Incident management fixes root cause”
✓ Incident management restores SERVICE. Problem management finds ROOT CAUSE

❌ “Cloud eliminates operational risk”
✓ Cloud transfers some risks — under shared responsibility, significant risks remain with the organisation

❌ “Tested BCP = good BCP”
✓ Only if tested against realistic scenarios, kept current, includes all critical systems

❌ “Normalisation improves performance”
✓ Normalisation improves DATA INTEGRITY — it can hurt performance (more joins). Denormalisation is used for performance

❌ “Operations team owns BCP”
✓ BCP ownership = SENIOR MANAGEMENT. IT operations supports recovery

❌ “Encryption makes backups secure”
✓ Encryption + physical security + off-site storage + access control. All required

❌ “IDS and IPS both prevent intrusions”
✓ IDS detects and alerts (detective). IPS can block (preventive + detective)