CORTIS: Text-Only Adaptation of Spoken Language Models

CORTIS enables task-oriented voice agents to generate structured speech outputs by fine-tuning spoken language models using only text-form task supervision. It outperforms ASR-LLM cascades under acoustic degradation, especially in preserving high-level task semantics, without requiring paired speech-target annotations during training.