ITFormer: Bridging Time Series Signal and Natural Language for Multi-Modal QA

The First Large-Scale Multi-Task Time-Series Question Answering Dataset and Framework

ICML 2025 Time-Series QA Multi-Modal AI 110k+ QA Pairs
Shanghai Jiao Tong University
Shanghai Innovation Institute
Fudan University

Abstract

Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge.

To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language.

Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving strong improvement in QA accuracy over baselines with fewer than 1% additional trainable parameters.

EngineMT-QA Dataset

A comprehensive multi-task QA dataset based on real-world aero-engine sensor data. EngineMT-QA contains 110k+ QA pairs across four task types constructed from 32-channel flight data using the N-CMAPSS dataset.

110k+ QA Pairs
32 Channels
4 Task Types

Task Categories:

  • Understanding: Interpret sensor relationships and semantic implications
  • Perception: Uncover health state semantics and fault diagnosis
  • Reasoning: Infer degradation trends and predict failure probability
  • Decision-Making: Generate maintenance recommendations and operational decisions

ITFormer Architecture

Key components enabling effective temporal-textual modeling:

  • TPE (Time Token Position Encoding): Temporal + channel + segment positional encoding
  • LIT (Learnable Instruct Tokens): Instructional tokens guiding semantic alignment
  • ITA (Instruct Time Attention): Temporal-textual cross-modal attention mechanism
  • TAL (Time Token as Language): Projects time tokens as natural language inputs for LLMs
Key Innovation: ITFormer acts as an intermediary connector, enabling seamless integration between temporal encoders and frozen LLMs with minimal computational overhead.

Results

ITFormer achieves state-of-the-art performance on the EngineMT-QA benchmark. With less than 1% additional trainable parameters, its accuracy and robustness scale well with model size, outperforming both vision-text and time-series baselines.

Understanding (Rouge-L)
58.04
Perception (Accuracy)
65.07%
Reasoning (F1)
88.69
Decision (BLEU)
38.68

Performance scales consistently across model sizes (0.5B, 3B, 7B parameters), demonstrating the effectiveness of our approach in integrating time-series signals with natural language understanding.

Figures & Visualizations

Key visualizations from our research showcasing the ITFormer framework and experimental results.

Authors & Affiliations

Shanghai Jiao Tong University
Yilin Wang, Peixuan Lei, Jie Song, Tao Chen, Haoyu Zhe, Yuxuan Zhang, Lei Jia, Yuanxiang Li
Shanghai Innovation Institute
Yilin Wang, Zhongyu Wei
Fudan University
Zhongyu Wei

Corresponding Author: Yuanxiang Li (yuanxli@sjtu.edu.cn)

Code & Citation

Stay tuned for the release of our complete codebase, including:

  • EngineMT-QA dataset with 110k+ QA pairs
  • ITFormer implementation and training scripts
  • Evaluation benchmarks and baseline comparisons
  • Pre-trained models and checkpoints
📥 Download Dataset (Coming Soon) 💻 View Code (Coming Soon)