Project Ghost — System Architecture¶
- Status: living, Phase 0 (foundations)
- Last updated: 2026-06-03
- Audience: maintainers and integrators of the project over 1–5 years
This document is the single source of truth for the architecture. Any change that contradicts this text requires a formal ADR in docs/adr/.
1. Vision¶
Build a drone autonomy system that evolves from simulation to real hardware, whose ultimate goal is navigation in unknown environments without GPS, using only vision and IMU.
This is not a prototype or a demo. It is designed to live five years, support multiple simulators, and eventually run on real hardware with minimal changes to the higher layers.
2. Guiding principles¶
Each principle is binding. Breaking them requires an ADR.
- Sim-first, hardware-ready. The same code that flies in simulation must fly on hardware by swapping a single backend.
- HAL before features. No estimation, control, planning, or mission work begins until the HAL is frozen and validated.
- Absolute determinism in simulation. Same seed + same scenario + same code = same trajectory, bit for bit.
- Obsessive telemetry. Every message that crosses the bus is persisted to disk with a versioned schema.
- Explicit math, no black boxes. EKF/UKF, BA, MPC are implemented in a readable and testable way. Closed models are forbidden.
- Modularity by contracts, not services. In-process pub/sub, typed dataclasses, no microservices.
- Near-zero cost. The stack runs on a modest laptop. Any heavy dependency must justify itself.
- Every decision pays its future cost. A cheap decision today that costs a lot to maintain tomorrow is rejected.
3. System layers¶
The system is organized in eight layers ordered from lower to higher abstraction. Each layer only knows the one immediately below it via typed interfaces.
graph TD
L7[Layer 7 — Mission<br/>Behavior tree / FSM]
L6[Layer 6 — Planning<br/>Global + local]
L5[Layer 5 — Mapping<br/>Occupancy / ESDF]
L4[Layer 4 — SLAM / VIO<br/>Frontend + backend + loop closure]
L3[Layer 3 — State estimation<br/>EKF / UKF / fusion]
L2[Layer 2 — Perception<br/>Features / depth / detection]
L1[Layer 1 — Flight control<br/>Attitude / position]
L0[Layer 0 — HAL<br/>SimulationBackend / RuntimeBackend]
L7 --> L6 --> L5 --> L4 --> L3 --> L2 --> L0
L7 --> L1 --> L0
L3 --> L1
L5 --> L6
Notes:
- Layer 0 (HAL) abstracts sensors and actuators. It is the only layer that knows simulators or hardware.
- Layer 1 (Control) depends directly on the HAL to close fast loops without traversing cognitive layers.
- Layers 2–6 are the perception–cognition chain culminating in plans.
- Layer 7 (Mission) orchestrates the rest via events and goals.
In Phase 1, only Layer 0, a minimal version of Layer 1 (manual passthrough), and the transversal infrastructure (clock, telemetry, events) exist. The rest is introduced in later phases.
4. Transversal infrastructure¶
Three services cross all layers:
graph LR
subgraph "Transversal infrastructure"
CLK[SimClock / SystemClock]
BUS[Event Bus]
TEL[Telemetry Bus]
end
L0[HAL backend] --> CLK
L0 --> BUS
L0 --> TEL
L1[Control] --> CLK
L1 --> BUS
L1 --> TEL
- Clock: single source of time. In sim, integer-nanosecond deterministic. On hardware, the system monotonic clock.
- Event Bus: in-process pub/sub for discrete semantic events (armed, takeoff, sensor_fault, collision, mission_*).
- Telemetry Bus: captures everything published and distributes it to sinks (MCAP file, Rerun, console).
5. Data flow¶
A typical control cycle in Phase 1:
sequenceDiagram
participant Sim as Backend (PyBullet)
participant HAL as HAL Providers
participant Bus as Event Bus
participant Ctrl as Control (manual)
participant Tel as Telemetry
Sim->>HAL: step(dt) advances physics
HAL->>Bus: SensorSample[IMU] @ stamp_sim_ns
HAL->>Bus: SensorSample[RGB] @ stamp_sim_ns
HAL->>Tel: write(/sensors/...)
Bus->>Ctrl: notify
Ctrl->>HAL: ActuatorCommand
HAL->>Sim: apply commands
Ctrl->>Tel: write(/cmd/...)
Rules:
- Every published sample carries
stamp_sim_ns,stamp_wall_ns, and a per-sensor monotonicseq. - The backend does not write to the telemetry bus directly: providers do, to keep a single capture point.
- Control never calls physics: it only publishes commands to the
ActuatorSink.
6. Modules and responsibilities¶
Package (src/project_ghost/) |
Responsibility | Phase introduced |
|---|---|---|
core/ |
Base types, errors, time utilities, RandomSource, configuration | Phase 1 |
hal/ |
HAL Protocols, messages, and conformance tests | Phase 1 |
state/ |
Canonical VehicleState, frame conventions, transforms |
Phase 1 |
events/ |
Event bus, event schemas, severities, correlation | Phase 1 |
telemetry/ |
Telemetry bus, sinks (MCAP, Rerun, console), per-run manifest | Phase 1 |
simulation/ |
Simulation backends; in Phase 1 only PyBullet | Phase 1 |
sensors/ |
Backend-agnostic providers (processing, optional sync) | Phase 1 minimal |
actuators/ |
Backend-agnostic sinks, safety envelope, mixers | Phase 1 minimal |
perception/ |
Features, depth, detection | Phase 3+ |
estimation/ |
EKF/UKF, IMU preintegration | Phase 3 |
slam/ |
Frontend, backend, loop closure | Phase 4 |
mapping/ |
Occupancy, ESDF, frontiers | Phase 5 |
planning/ |
Global and local | Phase 5–6 |
control/ |
Attitude/velocity/position cascade, MPC | Phase 2+ |
mission/ |
Behavior trees, FSM, goals | Phase 7 |
Any package after Phase 1 is introduced only when needed, with its own ADR if it modifies frozen contracts.
7. Permitted and forbidden dependencies¶
One of the project's most important decisions. Validation is automated in CI (import-linter or deptry).
| Package | Permitted | Forbidden |
|---|---|---|
core/ |
stdlib, numpy, typing |
everything else |
hal/ |
stdlib, numpy, core |
OpenCV, Torch, PyBullet, Gazebo, ROS, MAVSDK |
state/ |
stdlib, numpy, scipy.spatial.transform, core |
sensors, backends |
events/ |
stdlib, core |
disk I/O, backends |
telemetry/ |
stdlib, mcap, mcap-protobuf-support, rerun-sdk, core |
simulation backends |
simulation/pybullet/ |
pybullet, numpy, hal, state, core |
Torch, OpenCV (except for camera-buffer conversion), ROS |
simulation/gazebo_px4/ (Phase 4+) |
mavsdk, gz bindings, hal, state, core |
PyBullet |
Cognitive layers (perception/, estimation/, etc.) |
their specific deps + hal, state, core |
concrete backends, other parallel cognitive layers |
Global rules:
hal/never imports anything fromsimulation/. If it needs to know about backend capabilities, it does so via theCapabilitiesinterface.- No cognitive layer imports another cognitive layer at the same level. Communication is via the bus.
- Any ML model used must have open weights and a permissive license. Ultralytics (AGPL) is forbidden.
8. Frozen technical conventions¶
These conventions are the most expensive to revert and are fixed now. Changing them requires an ADR.
| Aspect | Decision |
|---|---|
| World frame | ENU (East-North-Up), z=0 at ground |
| Body frame | FLU (Forward-Left-Up) |
| Quaternion | Hamilton, [w, x, y, z], compatible with scipy.spatial.transform.Rotation |
| Units | Strict SI. No feet, no knots. Angles in radians except in config/UI |
| Precision | float64 for state and covariances; float32 only in image/depth buffers |
| Time | Integer nanoseconds in all timestamps. Floats forbidden for time |
| Randomness | Hierarchical injected RandomSource. Global random or np.random forbidden |
| Project layout | src/-layout with src/project_ghost/... |
| Schema versioning | Every message dataclass carries schema_version: int |
Conversion to/from external conventions (PX4 NED/FRD, JPL quaternion, etc.) happens in the boundary adapter, never scattered across the code.
9. Growth strategy¶
The project is built in orthogonal phases. Each phase adds a layer without rewriting the previous ones.
| Phase | Adds | Active layers | Exit criterion |
|---|---|---|---|
| 0 — Foundations | Docs and contracts | — | Architecture, ADRs, and specs approved |
| 1 — Sim + telemetry + manual | HAL, PyBullet, Rerun, replay | 0 | make demo flies manually and replays identically |
| 2 — Stabilization | Cascade control | 0, 1 | Hover ±5 cm over 60 s with IMU noise |
| 3 — State estimation | 15-DoF EKF + VO | 0–3 | ATE < 5% on the corridor scenario |
| 4 — SLAM | BA backend, loop closure | 0–4 | Drift < 1% after loop closure on warehouse |
| 5 — Planning | A*, local MPC, ESDF | 0–6 | 0 collisions on forest, 95% success over 50 runs |
| 6 — Autonomous navigation | Replanning + recovery | 0–6 | Mission in a new scenario, 90% success |
| 7 — End-to-end mission | Behavior tree, exploration | 0–7 | Explore, find target, return, autonomous |
Each phase runs:
- Read of current specs.
- Detailed design (ADRs if contracts are affected).
- Implementation + tests.
- Acceptance metrics measured on the regression bench.
- Updated documentation.
10. Hardware migration strategy¶
This is the acid test of the design. If migration hurts, the architecture failed.
Supported backends. Project Ghost officially supports three backends through its lifetime:
graph LR
SIM[PyBullet<br/>Phases 1-3] --> SITL[Gazebo + PX4 SITL<br/>Phases 4-7]
SITL --> HITL[PX4 HITL<br/>Pixhawk + Sim]
HITL --> HW[Hardware<br/>Companion Linux + Pixhawk]
Other simulators (Isaac Sim, AirSim, Webots) are possible future community backends, not architectural commitments. They are welcome but not maintained by the core. The HAL contract is what makes such future additions cheap.
The two key substitutions:
SimulationBackend→SimulationBackend(Gazebo+PX4 SITL): the HAL contract does not change. The concrete backend changes. Layers 1–7 do not notice.SimulationBackend→RuntimeBackend(hardware): the time paradigm is replaced (sim → wall) and the MAVLink adapter is introduced. The rest of the system keeps consuming the same Protocols.
Mandatory changes when moving to hardware:
- Randomness:
RandomSourcestops being deterministic (the real world cannot be frozen). Runs are marked as non-deterministic in the manifest. - Time:
SystemClockreplacesSimClock. Periodic schedulers are tied to the OS monotonic clock. - Safety: the
SafetyEnvelopestops being preventive and becomes load-bearing, redundant with a hardware kill switch. - Camera: calibration is a repo artifact, not generated by the simulator.
Risk mitigations:
- Before real hardware, HITL is mandatory with the same physical Pixhawk that will fly.
- Initial flights are indoor, tethered, or in a netted cage, with telemetry always on.
- Software geofence + RTL are configured before each flight.
- Regulatory framing (Mexico, AFAC): experimental flights under supervision, documented weight.
11. What this system is not¶
Explicitly out of scope:
- A replacement autopilot for PX4 / ArduPilot. Project Ghost lives on top of an autopilot on real hardware.
- An LLM agent framework. There is no agent orchestration in the autonomy system.
- A commercial product. It is open, reproducible research.
- An outdoor drone in extreme weather for Phases 1–7. Focus: indoor / controlled outdoor.
12. Glossary¶
- HAL: Hardware Abstraction Layer. Sensor/actuator contract.
- SITL: Software In The Loop. Firmware running on a PC with a connected sim.
- HITL: Hardware In The Loop. Firmware running on the physical autopilot with a connected sim.
- MCAP: Open binary log format, indexed, multi-channel, schema-aware.
- ATE / RPE: Absolute / Relative Trajectory Error. Standard SLAM/odometry metrics.
- ESDF: Euclidean Signed Distance Field. Differentiable distance map for planning.
- ENU / FLU / NED / FRD: Frame conventions. ENU/FLU is the pair used here.
- VIO: Visual-Inertial Odometry.
13. Cross-references¶
- ADRs:
docs/adr/ - Detailed specs:
docs/specs/ - Adversarial review:
docs/reviews/architecture_red_team_review.md - Phase 1 plan:
docs/roadmaps/phase1.md