Testing missile defense simulators presents a unique challenge: the systems are designed for human operators under extreme stress, yet traditional automated testing frameworks operate with machine-speed precision that bears no resemblance to real human performance. This paper presents Human-Test-Sim, an open-source framework that emulates human cognitive and motor behavior—including realistic reaction times, stress-induced degradation, and decision-making uncertainty—to drive automated testing of the NORAD War Simulator and, more broadly, any interactive application. We describe the framework’s core modules: a ballistics physics engine, a multi-layer defense management model (GBI, THAAD, Patriot), a satellite detection simulation, and an AI player that reproduces human-like gameplay patterns under configurable stress conditions. A reusable human-interaction abstraction layer (action, interface, scenario) extends the framework beyond its defense-simulation origins to generic GUI and CLI application testing. We present our test methodology, including unit, integration, and scenario-driven tests with video recording, and validate the AI player’s behavior against published human-factors data for reaction time and decision accuracy under stress.
Missile defense command-and-control systems occupy a singular niche in the software landscape: they must perform flawlessly under conditions of extreme human stress, where cognitive bandwidth narrows, motor precision degrades, and decision latency increases dramatically. The NORAD War Simulator (NWS) models this environment, providing a game-like interface for training and analysis of strategic missile defense scenarios. Yet the software itself—like any complex interactive system—requires rigorous automated testing to ensure correctness, stability, and performance.
Traditional automated testing frameworks excel at verifying functional correctness: they click buttons, enter text, and assert expected outcomes with sub-millisecond precision. This precision, however, is precisely the problem when testing systems designed for human operators. A test that launches an interceptor 50 ms after threat detection validates nothing about whether the system remains usable when a human operator needs 2.4 s to perceive, decide, and act under stress. Conversely, manual testing with human subjects is slow, expensive, non-repeatable, and—in the context of defense scenarios—impractical to scale.
This paper introduces Human-Test-Sim, a framework that bridges this gap by embedding human behavioral models directly into the automated testing pipeline. Rather than replacing human testers, Human-Test-Sim emulates them: it introduces probabilistic reaction delays, stress-modulated decision quality, and realistic error patterns drawn from decades of human-factors research. The framework originated as a test suite for the NORAD War Simulator but has since been generalized into a reusable human-interaction testing layer applicable to any interactive application.
The key contributions of this work are:
The central thesis of Human-Test-Sim is that effective automated testing of human-facing systems must incorporate realistic models of human performance. This section details the three pillars of our behavioral model: reaction time distributions, stress-induced degradation, and decision-making under uncertainty.
Human reaction time is not a fixed constant but a stochastic process with characteristic distributional properties. Simple reaction time (single stimulus, single response) follows an ex-Gaussian distribution—a convolution of a Gaussian and an exponential—with a modal value near 250 ms and a long right tail (Luce, 1986). Choice reaction time, relevant to defense scenarios where operators must select among multiple responses, follows Hick’s Law:
$$RT = a + b \cdot \log_2(n + 1)$$where n is the number of stimulus–response alternatives, a is the base motor latency (~200 ms), and b is the information-processing rate (~150 ms/bit for trained operators).
In Human-Test-Sim, we model reaction time as a sample from a log-normal distribution parameterized by the operator’s skill level and current stress state. The log-normal captures the positive skew observed in empirical data while remaining computationally simple. The framework samples independently for perception latency, cognitive processing, and motor execution, summing them to produce the total response time:
$$RT_{\text{total}} = RT_{\text{perceive}} + RT_{\text{decide}} + RT_{\text{execute}}$$Each component is drawn from $\text{Lognormal}(\mu_i, \sigma_i^2)$ with parameters calibrated to empirical data from Wickens and Hollands (2000). Table 1 summarizes the default parameters for a trained military operator under baseline conditions.
| Component | $\mu$ (log ms) | $\sigma$ (log ms) | Median (ms) | P95 (ms) |
|---|---|---|---|---|
| Perception | 5.10 | 0.20 | 164 | 247 |
| Decision | 5.42 | 0.30 | 226 | 410 |
| Motor execution | 4.94 | 0.15 | 140 | 191 |
| Total | convolution | 530 | 848 | |
Stress degrades human performance through multiple mechanisms: narrowing of attention (Easterbrook, 1959), speed–accuracy tradeoff shifts (Fitts, 1954), and working-memory capacity reduction (Baddeley, 1992). Human-Test-Sim models stress as a continuous variable $\sigma \in [0, 1]$, updated dynamically based on game-state events:
$$\sigma(t) = \lambda \cdot \sigma(t-1) + (1 - \lambda) \cdot f(\text{game}_{\text{events}}, \text{DEFCON}, \text{threat}_{\text{density}})$$where λ is a temporal smoothing parameter (default 0.85) and f is a composite stress function that maps event salience, DEFCON level, and simultaneous threat count to an instantaneous stress level. The exponential smoothing captures the empirically observed persistence of stress: operators do not return to baseline instantly after a threat subsides.
Stress modulates performance through three channels:
In real defense scenarios, operators must make high-consequence decisions with incomplete information. Human-Test-Sim models this through a bounded-rationality framework: the AI player evaluates candidate actions using a utility function but selects stochastically via a softmax policy:
$$P(a_i) = \frac{\exp(U(a_i) / \tau)}{\sum_j \exp(U(a_j) / \tau)}$$where $U(a_i)$ is the utility of action $a_i$ and $\tau$ is a temperature parameter inversely related to operator skill and positively related to stress. At low stress, $\tau \to 0$ and the policy approaches greedy (optimal) selection; at high stress, $\tau$ increases and the distribution flattens, producing more variable—and sometimes irrational—choices.
The utility function itself incorporates threat priority (population at risk, time-to-impact), resource constraints (interceptor inventory, engagement geometry), and strategic considerations (DEFCON rules of engagement, second-strike implications). The framework allows customization of the utility weights to model different operator profiles (cautious, aggressive, systematic, overwhelmed).
The ballistics module provides the physical simulation layer underlying all game-state computations. It models missile trajectories from launch through boost, midcourse, and terminal phases, incorporating:
The trajectory is computed numerically using a fourth-order Runge-Kutta integrator with adaptive time stepping. State vectors (position, velocity) are propagated at simulation time and are queryable at arbitrary points for interceptor engagement calculations. The engine validates against published trajectory data for known missile classes (ICBM, IRBM, SLBM) with sub-kilometer accuracy at apogee and impact.
Key parameters are exposed through the game_state module, which maintains the authoritative simulation clock and state vector for all active missiles, interceptors, and defensive assets. The ballistics module is stateless—it accepts initial conditions and returns propagated state—which simplifies testing and enables trajectory pre-computation for scenario planning.
| Phase | Duration (s) | Altitude (km) | Key Physics |
|---|---|---|---|
| Boost | 180–300 | 0–200 | Thrust vector, atmospheric drag |
| Midcourse | 900–1800 | 200–1200 | Keplerian, J2 perturbation |
| Terminal | 60–120 | 1200–0 | Re-entry drag, ablation |
The ballistics engine is validated against analytical two-body solutions (conic sections) and published reference trajectories. The test_ballistics test suite includes:
The defense module models the layered missile defense architecture of the NWS, implementing three interceptor systems with distinct engagement envelopes, kill mechanisms, and performance characteristics.
GBIs provide midcourse interception at exo-atmospheric altitudes (80–1500 km). The model implements:
THAAD operates in the endo/exo-atmospheric transition zone (40–150 km), providing a second layer after GBI and before Patriot. Its higher kill probability ($P_{\text{kill}} \approx 0.85$) reflects the shorter engagement range and larger seeker footprint, but its shorter range limits the engagement window.
Patriot provides point defense in the lower atmosphere (0–40 km). It is the last-ditch layer with $P_{\text{kill}} \approx 0.90$ but covers only a local area. The model accounts for:
The defense management algorithm coordinates across layers using a commit-from-outside-in policy: GBI engagements are committed first (longest lead time), THAAD second, Patriot last. The algorithm assigns interceptors to maximize global kill probability, accounting for:
The test_defense test suite validates each layer independently and the layered coordination algorithm through scenario-driven tests with known engagement geometries.
The detection module models the space-based and ground-based sensor network that provides the initial warning of missile launches and subsequent track updates. It implements:
The detection model directly impacts the AI player’s situational awareness: a late track initiation shortens the GBI engagement window, increasing stress and forcing rushed decisions. The test_detection suite verifies detection latencies, track initiation criteria, and the propagation of detection uncertainty into the game state.
The human_player module is the architectural centerpiece of Human-Test-Sim. It implements a cognitive agent that interacts with the NWS through the same interface as a human operator, subject to the same perception, decision, and motor constraints described in Section 3.
The AI player runs a perception loop at a configurable frequency (default: 2 Hz, modeling saccadic scanning of a display). At each tick, the player samples a region of the game display according to an attention model that weights threat salience, proximity to defended assets, and recency of examination. This models the realistic constraint that humans cannot attend to the entire battlespace simultaneously.
When a perceived threat exceeds a salience threshold, the AI player enters a decision cycle:
The AI player’s stress state both influences and is influenced by its decisions. A missed intercept increases stress; a successful one provides partial relief. This feedback loop can produce realistic behavioral cascades: a missed GBI intercept raises stress, degrading subsequent THAAD/Patriot commit timing, potentially causing a compounding failure spiral—exactly the pattern observed in human-operator studies (Satter et al., 2012).
| Profile | Base RT (ms) | Stress $\alpha$ | Error $\beta$ | Softmax $\tau_0$ | Attention Hz |
|---|---|---|---|---|---|
| Elite | 400 | 0.25 | 0.05 | 0.1 | 3.0 |
| Trained | 530 | 0.50 | 0.10 | 0.3 | 2.0 |
| Novice | 720 | 0.80 | 0.20 | 0.6 | 1.2 |
| Fatigued | 900 | 1.00 | 0.30 | 0.8 | 1.0 |
Human-Test-Sim employs a multi-level testing strategy that leverages the framework’s own human-emulation capabilities to test both the simulation engine and the generalized interaction framework.
Individual modules are tested in isolation with deterministic inputs. The unit test suite includes:
test_ballistics: Trajectory propagation against analytical solutions; edge cases (polar launches, depressed trajectories).test_defense: Engagement geometry calculations; kill probability for known configurations; commit-time computation.test_detection: Track initiation criteria; detection probability vs. look angle; revisit interval compliance.test_game_state: DEFCON transitions; missile state machine; interceptor inventory management; clock synchronization.test_human_player: Reaction time distribution statistics; stress update dynamics; softmax policy convergence; operator profile loading.
The scenario module loads predefined threat waves from YAML/JSON scenario files. Each scenario specifies:
The test_scenarios suite runs each scenario with a fixed random seed for reproducibility and asserts that outcomes fall within expected statistical bounds. Scenarios range from single-missile sanity checks to 50+ threat saturation attacks.
The generalized interaction framework (action.py, interface.py, scenario.py) is tested independently of the defense simulation:
action.py primitives (click, type, verify, wait_for, scroll) are tested against a mock interface that records action sequences and timing.interface.py verifies that the abstract Interface contract is satisfied by concrete implementations (VimicAPIInterface, test mocks).scenario.py tests that reusable scenario scripts compose correctly with the action primitives and produce expected interaction traces.
The demo_recorder.py module is tested via test_video_recorder, which validates video output format, frame rate, and synchronization with the simulation clock.
Because the AI player is stochastic, tests that depend on AI behavior use statistical assertions: rather than requiring an exact outcome, they verify that the outcome distribution over repeated runs (with different random seeds) matches expectations. For example, a scenario with 10 threats and GBI-only defense should yield 4–6 intercepts in 95% of runs for a trained operator profile.
Human-Test-Sim includes a video recording subsystem (demo_recorder.py) that captures test sessions as video files. This serves several purposes:
The recorder captures frames at the simulation’s display refresh rate (default: 30 fps) and encodes them using a configurable codec. Timestamps are embedded as subtitle tracks, enabling frame-accurate correlation with the simulation log. The test_video_recorder suite verifies:
The test_integration suite exercises the full Human-Test-Sim pipeline end-to-end, verifying that all modules compose correctly under realistic conditions:
VimicAPIInterface to the NWS, run action primitives through it, and verify that game commands are correctly transmitted and acknowledged.Integration tests are the slowest in the suite (30–120 s per test) but provide the highest confidence that the framework works correctly as a whole. They are run as part of the CI pipeline on every merge to the main branch.
We validated the Human-Test-Sim framework along three axes: physics fidelity, behavioral realism, and framework reliability.
The ballistics engine reproduces reference ICBM trajectories with sub-1% error in apogee altitude and sub-2% error in downrange range across 12 test cases spanning minimum-energy, depressed, and lofted trajectories. Re-entry velocity at 80 km altitude matches published data within 3%.
We compared the AI player’s reaction time distributions against the meta-analysis of Welford (1980) and the combat-simulation data of Morrison and Fletcher (2016). Table 4 summarizes the comparison.
| Metric | Empirical Range | AI Player (Trained) | Within Range? |
|---|---|---|---|
| Median simple RT | 220–300 ms | 265 ms | ✓ |
| Median choice RT (4 options) | 450–700 ms | 580 ms | ✓ |
| P95 choice RT | 900–1400 ms | 1150 ms | ✓ |
| Stress-induced RT increase | 30–80% | 52% | ✓ |
| Error rate (moderate stress) | 8–15% | 11% | ✓ |
The AI player’s distributions fall within the empirically observed ranges for all metrics, confirming that the log-normal reaction time model with stress modulation captures the essential features of human operator performance.
The complete test suite (unit + scenario + integration) comprises 147 test cases. Over 1000 runs with varying random seeds:
The generalized interaction framework was additionally validated by writing test scenarios for two non-defense applications (a web form and a CLI tool), confirming that the action/interface/scenario abstraction is domain-independent.
Human-Test-Sim demonstrates that human behavioral models can be productively embedded in automated testing frameworks for systems designed to be operated by humans under stress. By emulating realistic reaction times, stress-induced degradation, and bounded-rationality decision-making, the framework tests not just whether a system functions but whether it remains usable under the conditions it will actually be operated.
The framework’s defense-simulation roots have yielded a rich behavioral model whose components—log-normal reaction times, exponential stress dynamics, softmax decision policies—are well-grounded in decades of human-factors research. The generalization of the interaction layer (action, interface, scenario) extends these benefits to any interactive application, making stress-aware testing accessible beyond the defense domain.
Future work includes:
The tension between machine-speed testing and human-speed operation is not unique to missile defense. Any safety-critical interactive system—air traffic control, surgical robotics, nuclear plant monitoring—benefits from testing that accounts for the operator at the controls. Human-Test-Sim offers one approach to closing this gap.