Skip to content

NJUNLP/context-synthesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generalizing from Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

📃 Paper | 🤗 Huggingface | 📭 Contact

⛰️ Overview

This repository contains data synthesis scripts and resources for conducting long-context instruction tuning. In this project, we introduce "context synthesis", a novel approach that leverages off-the-shelf LLMs (such as GPT-4o, Qwen-2.5, LongWriter) to generate high-quality background context from short-context instruction-answer pairs.

This approach offers three advantages:

(1) in contrast to previous work which synthesizes instructions and target outputs, our synthetic data only forms part of the input to the model rather like back-translation for machine translation, preserving the quality of instructions and outputs.

(2) by generating background contexts, we can seamlessly integrate both supporting evidence and distracting information into a coherent narrative.

(3) our approach enables control over context through expansion and concatenation to harness the benefits of training on longer sequences.

🛠️ Data Synthesis

We provide the scripts of context synthesis to build long-context instruction data.

Synthesis scripts:

cd scripts
bash context_synthesis.sh

Key Components

  • generate_synthetic_context.py: Generates background context based on instruction-answer pairs
  • convert_to_chat_format.py: Wraps the data in chat format, and extends context length by concatenating multiple contexts (optional)
  • We use vllm as the inference engine for open-source LLMs. Please follow their repository instructions to set up the inference environment
  • For proprietary models like GPT-4o, we recommend using batched API calls to save costs (script coming soon)

Data Download

In our experiments, we perform both:

  • Context synthesis (with instructions in ./seed_instruction)
  • Instruction synthesis (with context in ./seed_context)

Our synthesized data is available in our Hugging Face collection.

⏳ Instruction-tuning

We utilize the LongAlign framework for long-context instruction-tuning:

📏 Evaluation

For model evaluation, we employ several document-level benchamrks. Please refer to their respective repositories for detailed implementation and usage instructions.

🌲 Citation

@misc{zhu2025generalizingshortlongeffective,
      title={Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning}, 
      author={Wenhao Zhu and Pinzhen Chen and Hanxu Hu and Shujian Huang and Fei Yuan and Jiajun Chen and Alexandra Birch},
      year={2025},
      eprint={2502.15592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.15592}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published