训练奖励模型

因为之前找的训练好的奖励模型不能在ColossalAI中的强化学习阶段使用，并且是用纯英文数据集训练的，因此我们用ColossalAI的代码和中文数据集训练了奖励模型。

数据集

基于anthropic的Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 论文开源的关于有助和无害的人类偏好数据hh-rlhf的翻译成中文的 hh_rlhf_cn。

中文+英文数据集，使得同时能够识别中英文偏好。

hh_rlhf_cn数据集

将数据集的context与chosen、regected合并，作为奖励模型的输入。

from datasets import load_dataset, load_from_disk

def merge_prefix(example):
    context_all = ""
    for context in example["context"]:
        context_all += "\n\n" + context["role"] + ": " + context["text"];
    example["chosen"] = context_all + "\n\n" + example["chosen"]["role"] + ": " + example["chosen"]["text"];
    example["rejected"] = context_all + "\n\n" + example["rejected"]["role"] + ": " + example["rejected"]["text"];
    return example

data = load_dataset("/Volumes/T7/数据集/hh_rlhf_cn")
new_data = data.map(merge_prefix, remove_columns=["context"])
new_data.save_to_disk("/Volumes/T7/数据集/hh_rlhf_cn_processed")

# data = load_from_disk("/Volumes/T7/数据集/hh_rlhf_cn_processed")

代码

使用ColossalAI中的代码

奖励模型代码

bloom-560m语言模型+输出大小为1的线性层

class BLOOMRM(RewardModel):
    """
    BLOOM Reward model.

    Args:
        pretrained (str): Pretrained model name or path.
        config (BloomConfig): Model config.
        lora_rank (int): LoRA rank.
        lora_train_bias (str): LoRA bias training mode.
    """

    def __init__(self,
                 pretrained: str = None,
                 config: Optional[BloomConfig] = None,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = BloomModel.from_pretrained(pretrained)
        elif config is not None:
            model = BloomModel(config)
        else:
            model = BloomModel(BloomConfig())

        value_head = nn.Linear(model.config.hidden_size, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.hidden_size + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)

class RewardModel(LoRAModule):
    """
    Reward model base class.

    Args:
        model (nn.Module): Reward model.
        value_head (nn.Module): Value head to get reward score.
        lora_rank (int): LoRA rank.
        lora_train_bias (str): LoRA bias training mode.
    """

    def __init__(self,
                 model: nn.Module,
                 value_head: Optional[nn.Module] = None,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        super().__init__(lora_rank=lora_rank, lora_train_bias=lora_train_bias)
        self.model = model
        self.convert_to_lora()

        if value_head is not None:
            if value_head.out_features != 1:
                raise ValueError("The value head of reward model's output dim should be 1!")
            self.value_head = value_head
        else:
            self.value_head = nn.Linear(model.config.n_embd, 1)

    def forward(self, sequences: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        outputs = self.model(sequences, attention_mask=attention_mask)
        last_hidden_states = outputs['last_hidden_state']
        values = self.value_head(last_hidden_states)[:, :-1]
        value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)
        return value

损失函数代码

loss(\theta)=-\frac{1}{K \choose 2}E_{(x,y_w,y_l)\sim D}[log(\sigma(r_\theta(x,y_w)-r_\theta(x,y_l)))]

class LogSigLoss(nn.Module):
    """
    Pairwise Loss for Reward Model
    Details: https://arxiv.org/abs/2203.02155
    """

    def forward(self, chosen_reward: torch.Tensor, reject_reward: torch.Tensor) -> torch.Tensor:
        probs = torch.sigmoid(chosen_reward - reject_reward)
        log_probs = torch.log(probs)
        loss = -log_probs.mean()
        return loss

代码运行参数

torchrun --standalone --nproc_per_node=1 train_reward_model.py \
    --model 'bloom' \
    --strategy colossalai_zero2 \
    --loss_fn 'log_sig' \
    --dataset 'Anthropic/hh-rlhf' \
    --pretrain ~/model/bloom-560m \
    --tokenizer ~/model/bloom-560m

运行结果

Step	Loss	Dist	Acc
344317	0.09899902	0.38238678	0.61886833

深度学习

#语言模型

训练奖励模型

https://wangyinan.cn/训练奖励模型

作者

yinan

发布于

2023年9月3日

许可协议

Linux命令大全上一篇

Proximal Policy Optimization (PPO) 下一篇