The authors introduce the Electro-Visual-Language Assistant (EVLA), a framework that integrates multi-modal scene understanding with real-time perception of an electrified powertrain's electro-mechanical state to improve driving decisions. This approach addresses the limitation of existing vision-language models that treat vehicle dynamics as a black box by incorporating physical constraints and optimization objectives.
- EVLA utilizes a Unified Co-State Encoder (UCSE) to fuse visual, textual, and vehicle-state inputs into a shared latent representation, augmented with an Energy-Efficiency Field.
- The framework employs an Electro-aware Structured Reasoning Chain (ESRC) that replaces external chain-of-thought prompting with an internal, deterministic reasoning process.
- EVLA is trained end-to-end using a physics-guided joint loss to generate context-aware and energy-optimal driving decisions.
- Evaluations on a driving QA benchmark show EVLA outperforms fine-tuned VLM baselines, improving the final score by +0.0871 and accuracy by +5.6%.
- Efficiency analyses indicate that EVLA achieves 36% faster inference compared to multi-stage pipelines.
Integrating vehicle-state awareness with structured physical reasoning is presented as crucial for developing next-generation driving assistants that are both physically grounded and energy-efficient.