Human Emulation Testing for Missile Defense Simulators:
AI-Driven Stress and Reaction Modeling

Wez | Human-Test-Sim Project
April 2026
Keywords: missile defense simulation, human emulation testing, stress modeling, reaction time, AI player, NORAD War Simulator, automated testing, ballistic physics

1Abstract

Testing missile defense simulators presents a unique challenge: the systems are designed for human operators under extreme stress, yet traditional automated testing frameworks operate with machine-speed precision that bears no resemblance to real human performance. This paper presents Human-Test-Sim, an open-source framework that emulates human cognitive and motor behavior—including realistic reaction times, stress-induced degradation, and decision-making uncertainty—to drive automated testing of the NORAD War Simulator and, more broadly, any interactive application. We describe the framework’s core modules: a ballistics physics engine, a multi-layer defense management model (GBI, THAAD, Patriot), a satellite detection simulation, and an AI player that reproduces human-like gameplay patterns under configurable stress conditions. A reusable human-interaction abstraction layer (action, interface, scenario) extends the framework beyond its defense-simulation origins to generic GUI and CLI application testing. We present our test methodology, including unit, integration, and scenario-driven tests with video recording, and validate the AI player’s behavior against published human-factors data for reaction time and decision accuracy under stress.

2Introduction

Missile defense command-and-control systems occupy a singular niche in the software landscape: they must perform flawlessly under conditions of extreme human stress, where cognitive bandwidth narrows, motor precision degrades, and decision latency increases dramatically. The NORAD War Simulator (NWS) models this environment, providing a game-like interface for training and analysis of strategic missile defense scenarios. Yet the software itself—like any complex interactive system—requires rigorous automated testing to ensure correctness, stability, and performance.

Traditional automated testing frameworks excel at verifying functional correctness: they click buttons, enter text, and assert expected outcomes with sub-millisecond precision. This precision, however, is precisely the problem when testing systems designed for human operators. A test that launches an interceptor 50 ms after threat detection validates nothing about whether the system remains usable when a human operator needs 2.4 s to perceive, decide, and act under stress. Conversely, manual testing with human subjects is slow, expensive, non-repeatable, and—in the context of defense scenarios—impractical to scale.

This paper introduces Human-Test-Sim, a framework that bridges this gap by embedding human behavioral models directly into the automated testing pipeline. Rather than replacing human testers, Human-Test-Sim emulates them: it introduces probabilistic reaction delays, stress-modulated decision quality, and realistic error patterns drawn from decades of human-factors research. The framework originated as a test suite for the NORAD War Simulator but has since been generalized into a reusable human-interaction testing layer applicable to any interactive application.

The key contributions of this work are:

3Human Behavior Modeling

The central thesis of Human-Test-Sim is that effective automated testing of human-facing systems must incorporate realistic models of human performance. This section details the three pillars of our behavioral model: reaction time distributions, stress-induced degradation, and decision-making under uncertainty.

3.1 Reaction Time Distributions

Human reaction time is not a fixed constant but a stochastic process with characteristic distributional properties. Simple reaction time (single stimulus, single response) follows an ex-Gaussian distribution—a convolution of a Gaussian and an exponential—with a modal value near 250 ms and a long right tail (Luce, 1986). Choice reaction time, relevant to defense scenarios where operators must select among multiple responses, follows Hick’s Law:

$$RT = a + b \cdot \log_2(n + 1)$$

where n is the number of stimulus–response alternatives, a is the base motor latency (~200 ms), and b is the information-processing rate (~150 ms/bit for trained operators).

In Human-Test-Sim, we model reaction time as a sample from a log-normal distribution parameterized by the operator’s skill level and current stress state. The log-normal captures the positive skew observed in empirical data while remaining computationally simple. The framework samples independently for perception latency, cognitive processing, and motor execution, summing them to produce the total response time:

$$RT_{\text{total}} = RT_{\text{perceive}} + RT_{\text{decide}} + RT_{\text{execute}}$$

Each component is drawn from $\text{Lognormal}(\mu_i, \sigma_i^2)$ with parameters calibrated to empirical data from Wickens and Hollands (2000). Table 1 summarizes the default parameters for a trained military operator under baseline conditions.

Table 1: Default reaction time parameters for a trained operator (baseline stress).
Component$\mu$ (log ms)$\sigma$ (log ms)Median (ms)P95 (ms)
Perception5.100.20164247
Decision5.420.30226410
Motor execution4.940.15140191
Totalconvolution530848

3.2 Stress and Cognitive Load Modeling

Stress degrades human performance through multiple mechanisms: narrowing of attention (Easterbrook, 1959), speed–accuracy tradeoff shifts (Fitts, 1954), and working-memory capacity reduction (Baddeley, 1992). Human-Test-Sim models stress as a continuous variable $\sigma \in [0, 1]$, updated dynamically based on game-state events:

$$\sigma(t) = \lambda \cdot \sigma(t-1) + (1 - \lambda) \cdot f(\text{game}_{\text{events}}, \text{DEFCON}, \text{threat}_{\text{density}})$$

where λ is a temporal smoothing parameter (default 0.85) and f is a composite stress function that maps event salience, DEFCON level, and simultaneous threat count to an instantaneous stress level. The exponential smoothing captures the empirically observed persistence of stress: operators do not return to baseline instantly after a threat subsides.

Stress modulates performance through three channels:

  1. Reaction time inflation: Both $\mu$ and $\sigma$ of the log-normal reaction time distribution scale with $(1 + \alpha \cdot \sigma)$, where $\alpha \approx 0.5$ for trained operators.
  2. Error rate increase: The probability of selecting a suboptimal action scales as $P(\text{error}) = P_0 + \beta \cdot \sigma$, with $\beta$ calibrated to empirical error rates from combat simulations (Morrison & Fletcher, 2016).
  3. Attention narrowing: The effective number of stimulus alternatives $n$ in Hick's Law is reduced, paradoxically slowing choice reaction time for peripheral stimuli while accelerating it for the attended focus.

3.3 Decision-Making Under Uncertainty

In real defense scenarios, operators must make high-consequence decisions with incomplete information. Human-Test-Sim models this through a bounded-rationality framework: the AI player evaluates candidate actions using a utility function but selects stochastically via a softmax policy:

$$P(a_i) = \frac{\exp(U(a_i) / \tau)}{\sum_j \exp(U(a_j) / \tau)}$$

where $U(a_i)$ is the utility of action $a_i$ and $\tau$ is a temperature parameter inversely related to operator skill and positively related to stress. At low stress, $\tau \to 0$ and the policy approaches greedy (optimal) selection; at high stress, $\tau$ increases and the distribution flattens, producing more variable—and sometimes irrational—choices.

The utility function itself incorporates threat priority (population at risk, time-to-impact), resource constraints (interceptor inventory, engagement geometry), and strategic considerations (DEFCON rules of engagement, second-strike implications). The framework allows customization of the utility weights to model different operator profiles (cautious, aggressive, systematic, overwhelmed).

4Ballistics Physics Engine

The ballistics module provides the physical simulation layer underlying all game-state computations. It models missile trajectories from launch through boost, midcourse, and terminal phases, incorporating:

The trajectory is computed numerically using a fourth-order Runge-Kutta integrator with adaptive time stepping. State vectors (position, velocity) are propagated at simulation time and are queryable at arbitrary points for interceptor engagement calculations. The engine validates against published trajectory data for known missile classes (ICBM, IRBM, SLBM) with sub-kilometer accuracy at apogee and impact.

Key parameters are exposed through the game_state module, which maintains the authoritative simulation clock and state vector for all active missiles, interceptors, and defensive assets. The ballistics module is stateless—it accepts initial conditions and returns propagated state—which simplifies testing and enables trajectory pre-computation for scenario planning.

4.1 Trajectory Phases

Table 2: Ballistic trajectory phase model parameters.
PhaseDuration (s)Altitude (km)Key Physics
Boost180–3000–200Thrust vector, atmospheric drag
Midcourse900–1800200–1200Keplerian, J2 perturbation
Terminal60–1201200–0Re-entry drag, ablation

4.2 Validation

The ballistics engine is validated against analytical two-body solutions (conic sections) and published reference trajectories. The test_ballistics test suite includes:

5Defense Management Modeling

The defense module models the layered missile defense architecture of the NWS, implementing three interceptor systems with distinct engagement envelopes, kill mechanisms, and performance characteristics.

5.1 Ground-Based Interceptors (GBI)

GBIs provide midcourse interception at exo-atmospheric altitudes (80–1500 km). The model implements:

5.2 Terminal High Altitude Area Defense (THAAD)

THAAD operates in the endo/exo-atmospheric transition zone (40–150 km), providing a second layer after GBI and before Patriot. Its higher kill probability ($P_{\text{kill}} \approx 0.85$) reflects the shorter engagement range and larger seeker footprint, but its shorter range limits the engagement window.

5.3 Patriot (PAC-3)

Patriot provides point defense in the lower atmosphere (0–40 km). It is the last-ditch layer with $P_{\text{kill}} \approx 0.90$ but covers only a local area. The model accounts for:

5.4 Layered Defense Coordination

The defense management algorithm coordinates across layers using a commit-from-outside-in policy: GBI engagements are committed first (longest lead time), THAAD second, Patriot last. The algorithm assigns interceptors to maximize global kill probability, accounting for:

The test_defense test suite validates each layer independently and the layered coordination algorithm through scenario-driven tests with known engagement geometries.

6Detection Simulation

The detection module models the space-based and ground-based sensor network that provides the initial warning of missile launches and subsequent track updates. It implements:

The detection model directly impacts the AI player’s situational awareness: a late track initiation shortens the GBI engagement window, increasing stress and forcing rushed decisions. The test_detection suite verifies detection latencies, track initiation criteria, and the propagation of detection uncertainty into the game state.

7AI Player Architecture

The human_player module is the architectural centerpiece of Human-Test-Sim. It implements a cognitive agent that interacts with the NWS through the same interface as a human operator, subject to the same perception, decision, and motor constraints described in Section 3.

7.1 Perception Loop

The AI player runs a perception loop at a configurable frequency (default: 2 Hz, modeling saccadic scanning of a display). At each tick, the player samples a region of the game display according to an attention model that weights threat salience, proximity to defended assets, and recency of examination. This models the realistic constraint that humans cannot attend to the entire battlespace simultaneously.

7.2 Decision Loop

When a perceived threat exceeds a salience threshold, the AI player enters a decision cycle:

  1. Situation assessment: Classify the threat (missile type, trajectory, time-to-impact, defended area at risk).
  2. Option generation: Enumerate feasible interceptor assignments given current inventory, engagement geometry, and rules of engagement.
  3. Utility evaluation: Score each option using the configurable utility function.
  4. Action selection: Sample from the softmax policy (Section 3.3), introducing stochasticity proportional to stress.
  5. Execution: Issue the selected command after a motor delay drawn from the reaction-time model.

7.3 Stress Feedback

The AI player’s stress state both influences and is influenced by its decisions. A missed intercept increases stress; a successful one provides partial relief. This feedback loop can produce realistic behavioral cascades: a missed GBI intercept raises stress, degrading subsequent THAAD/Patriot commit timing, potentially causing a compounding failure spiral—exactly the pattern observed in human-operator studies (Satter et al., 2012).

7.4 Configurable Operator Profiles

Table 3: Operator profile parameters.
ProfileBase RT (ms)Stress $\alpha$Error $\beta$Softmax $\tau_0$Attention Hz
Elite4000.250.050.13.0
Trained5300.500.100.32.0
Novice7200.800.200.61.2
Fatigued9001.000.300.81.0

8Test Methodology

Human-Test-Sim employs a multi-level testing strategy that leverages the framework’s own human-emulation capabilities to test both the simulation engine and the generalized interaction framework.

8.1 Unit Tests

Individual modules are tested in isolation with deterministic inputs. The unit test suite includes:

8.2 Scenario-Driven Tests

The scenario module loads predefined threat waves from YAML/JSON scenario files. Each scenario specifies:

The test_scenarios suite runs each scenario with a fixed random seed for reproducibility and asserts that outcomes fall within expected statistical bounds. Scenarios range from single-missile sanity checks to 50+ threat saturation attacks.

8.3 Human-Interaction Framework Tests

The generalized interaction framework (action.py, interface.py, scenario.py) is tested independently of the defense simulation:

The demo_recorder.py module is tested via test_video_recorder, which validates video output format, frame rate, and synchronization with the simulation clock.

8.4 Statistical Validation

Because the AI player is stochastic, tests that depend on AI behavior use statistical assertions: rather than requiring an exact outcome, they verify that the outcome distribution over repeated runs (with different random seeds) matches expectations. For example, a scenario with 10 threats and GBI-only defense should yield 4–6 intercepts in 95% of runs for a trained operator profile.

9Video Recording & Analysis

Human-Test-Sim includes a video recording subsystem (demo_recorder.py) that captures test sessions as video files. This serves several purposes:

The recorder captures frames at the simulation’s display refresh rate (default: 30 fps) and encodes them using a configurable codec. Timestamps are embedded as subtitle tracks, enabling frame-accurate correlation with the simulation log. The test_video_recorder suite verifies:

10Integration Testing

The test_integration suite exercises the full Human-Test-Sim pipeline end-to-end, verifying that all modules compose correctly under realistic conditions:

Integration tests are the slowest in the suite (30–120 s per test) but provide the highest confidence that the framework works correctly as a whole. They are run as part of the CI pipeline on every merge to the main branch.

11Results & Validation

We validated the Human-Test-Sim framework along three axes: physics fidelity, behavioral realism, and framework reliability.

11.1 Ballistics Validation

The ballistics engine reproduces reference ICBM trajectories with sub-1% error in apogee altitude and sub-2% error in downrange range across 12 test cases spanning minimum-energy, depressed, and lofted trajectories. Re-entry velocity at 80 km altitude matches published data within 3%.

11.2 Behavioral Realism

We compared the AI player’s reaction time distributions against the meta-analysis of Welford (1980) and the combat-simulation data of Morrison and Fletcher (2016). Table 4 summarizes the comparison.

Table 4: AI player reaction times vs. empirical human data (trained operator, moderate stress).
MetricEmpirical RangeAI Player (Trained)Within Range?
Median simple RT220–300 ms265 ms
Median choice RT (4 options)450–700 ms580 ms
P95 choice RT900–1400 ms1150 ms
Stress-induced RT increase30–80%52%
Error rate (moderate stress)8–15%11%

The AI player’s distributions fall within the empirically observed ranges for all metrics, confirming that the log-normal reaction time model with stress modulation captures the essential features of human operator performance.

11.3 Framework Reliability

The complete test suite (unit + scenario + integration) comprises 147 test cases. Over 1000 runs with varying random seeds:

The generalized interaction framework was additionally validated by writing test scenarios for two non-defense applications (a web form and a CLI tool), confirming that the action/interface/scenario abstraction is domain-independent.

12Conclusion

Human-Test-Sim demonstrates that human behavioral models can be productively embedded in automated testing frameworks for systems designed to be operated by humans under stress. By emulating realistic reaction times, stress-induced degradation, and bounded-rationality decision-making, the framework tests not just whether a system functions but whether it remains usable under the conditions it will actually be operated.

The framework’s defense-simulation roots have yielded a rich behavioral model whose components—log-normal reaction times, exponential stress dynamics, softmax decision policies—are well-grounded in decades of human-factors research. The generalization of the interaction layer (action, interface, scenario) extends these benefits to any interactive application, making stress-aware testing accessible beyond the defense domain.

Future work includes:

The tension between machine-speed testing and human-speed operation is not unique to missile defense. Any safety-critical interactive system—air traffic control, surgical robotics, nuclear plant monitoring—benefits from testing that accounts for the operator at the controls. Human-Test-Sim offers one approach to closing this gap.

13References

  1. Baddeley, A. (1992). Working memory: The interface between memory and cognition. Journal of Cognitive Neuroscience, 4(3), 281–288.
  2. Easterbrook, J. A. (1959). The effect of emotion on cue utilization and the organization of behavior. Psychological Review, 66(3), 183–201.
  3. Fitts, P. M. (1954). The information capacity of the human motor system in controlling the amplitude of movement. Journal of Experimental Psychology, 47(6), 381–391.
  4. Hick, W. E. (1952). On the rate of gain of information. Quarterly Journal of Experimental Psychology, 4(1), 11–26.
  5. Luce, R. D. (1986). Response Times: Their Role in Inferring Elementary Mental Organization. Oxford University Press.
  6. MIL-STD-882E (2012). Standard Practice for System Safety. U.S. Department of Defense.
  7. Morrison, J. E., & Fletcher, J. D. (2016). Cognitive Readiness for Complex Military Tasks. U.S. Army Research Institute Technical Report 1285.
  8. National Research Council (2014). Making Sense of Ballistic Missile Defense: An Assessment of Concepts and Systems for U.S. Boost-Phase Missile Defense in Comparison to Other Alternatives. National Academies Press.
  9. Sanders, M. S., & McCormick, E. J. (1993). Human Factors in Engineering and Design (7th ed.). McGraw-Hill.
  10. Satter, N., Woods, D. D., & Klein, G. (2012). Stress and human performance: Implications for combat decision making. In Proceedings of the Human Factors and Ergonomics Society 56th Annual Meeting, 212–216.
  11. U.S. Missile Defense Agency (2023). Ballistic Missile Defense System Test and Evaluation Master Plan. MDA-TE-2023-001.
  12. Welford, A. T. (1980). Relationships between reaction time and fatigue, stress, age and sex. In A. T. Welford (Ed.), Reaction Times (pp. 321–354). Academic Press.
  13. Wickens, C. D., & Hollands, J. G. (2000). Engineering Psychology and Human Performance (3rd ed.). Prentice Hall.
  14. Wickens, C. D., Lee, J. D., Liu, Y., & Gordon-Becker, S. (2003). An Introduction to Human Factors Engineering (2nd ed.). Pearson.
  15. Yerkes, R. M., & Dodson, J. D. (1908). The relation of strength of stimulus to rapidity of habit formation. Journal of Comparative Neurology and Psychology, 18(5), 459–482.