Abstract
Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge.
To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language.
Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving strong improvement in QA accuracy over baselines with fewer than 1% additional trainable parameters.
EngineMT-QA Dataset
A comprehensive multi-task QA dataset based on real-world aero-engine sensor data. EngineMT-QA contains 110k+ QA pairs across four task types constructed from 32-channel flight data using the N-CMAPSS dataset.
Task Categories:
-
Understanding: Interpret sensor relationships and semantic implications
-
Perception: Uncover health state semantics and fault diagnosis
-
Reasoning: Infer degradation trends and predict failure probability
-
Decision-Making: Generate maintenance recommendations and operational decisions
ITFormer Architecture
Key components enabling effective temporal-textual modeling:
-
TPE (Time Token Position Encoding): Temporal + channel + segment positional encoding
-
LIT (Learnable Instruct Tokens): Instructional tokens guiding semantic alignment
-
ITA (Instruct Time Attention): Temporal-textual cross-modal attention mechanism
-
TAL (Time Token as Language): Projects time tokens as natural language inputs for LLMs
Results
ITFormer achieves state-of-the-art performance on the EngineMT-QA benchmark. With less than 1% additional trainable parameters, its accuracy and robustness scale well with model size, outperforming both vision-text and time-series baselines.
Performance scales consistently across model sizes (0.5B, 3B, 7B parameters), demonstrating the effectiveness of our approach in integrating time-series signals with natural language understanding.
Figures & Visualizations
Key visualizations from our research showcasing the ITFormer framework and experimental results.





Code & Citation
📄 Paper: http://arxiv.org/abs/2506.14500
🔗 Project Page: https://pandalin98.github.io/itformer_site/
Stay tuned for the release of our complete codebase, including:
- EngineMT-QA dataset with 110k+ QA pairs
- ITFormer implementation and training scripts
- Evaluation benchmarks and baseline comparisons
- Pre-trained models and checkpoints