Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

This repository contains the code for our Paper:

Code-Switching Data Synthesis

This project synthesizes code-switching data for training purposes.

Training Code-Switching Synthesis Model: Run the following command to execute the training script:
```
bash scripts/sft.sh
```
You can find our sft data for Chinese, Romanian, and Bengali in the directory: data. We use Qwen2.5-3B-Instruct as the base model.
Split Documents: We then use the trained model to synthesize code-switching data base on the pretraining documents.

First split the documents to sentences. Prepare your pretraining documents as jsonl format with the key being "text". Run the following command:
```
python3 select_sents.py \
 --data_path $data_path \
 --replace_fraction $replace_ratio \
 --output_path $sents_output_path
```
Generating Code-Switching Content: The next step is use the trained model to generate desired code-switching content. You can use vllm or sglang backend to achieve this. After getting the results, replace the original sentences with these new ones:
```
python3 replace.py \
 --data_path $data_path \  # original pretraining docs path
 --replace_path $replace_path \ # the result file path of generation.
 --output_path $output_path
```
The final output file is the new code-switched documents.
Pre-training: We use Megatron-LM for pre-training without modification to its code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
scripts		scripts
src		src
.DS_Store		.DS_Store
README.md		README.md
replace.py		replace.py
requirements.txt		requirements.txt
select_sents.py		select_sents.py
train.py		train.py