训练奖励模型

训练奖励模型

因为之前找的训练好的奖励模型不能在ColossalAI中的强化学习阶段使用,并且是用纯英文数据集训练的,因此我们用ColossalAI的代码和中文数据集训练了奖励模型。

数据集

基于anthropic的Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 论文开源的关于有助和无害的人类偏好数据hh-rlhf的翻译成中文的 hh_rlhf_cn

中文+英文数据集,使得同时能够识别中英文偏好。

hh_rlhf_cn数据集

将数据集的context与chosen、regected合并,作为奖励模型的输入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from datasets import load_dataset, load_from_disk

def merge_prefix(example):
context_all = ""
for context in example["context"]:
context_all += "\n\n" + context["role"] + ": " + context["text"];
example["chosen"] = context_all + "\n\n" + example["chosen"]["role"] + ": " + example["chosen"]["text"];
example["rejected"] = context_all + "\n\n" + example["rejected"]["role"] + ": " + example["rejected"]["text"];
return example

data = load_dataset("/Volumes/T7/数据集/hh_rlhf_cn")
new_data = data.map(merge_prefix, remove_columns=["context"])
new_data.save_to_disk("/Volumes/T7/数据集/hh_rlhf_cn_processed")

# data = load_from_disk("/Volumes/T7/数据集/hh_rlhf_cn_processed")

代码

使用ColossalAI中的代码

奖励模型代码

bloom-560m语言模型+输出大小为1的线性层

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class BLOOMRM(RewardModel):
"""
BLOOM Reward model.

Args:
pretrained (str): Pretrained model name or path.
config (BloomConfig): Model config.
lora_rank (int): LoRA rank.
lora_train_bias (str): LoRA bias training mode.
"""

def __init__(self,
pretrained: str = None,
config: Optional[BloomConfig] = None,
lora_rank: int = 0,
lora_train_bias: str = 'none') -> None:
if pretrained is not None:
model = BloomModel.from_pretrained(pretrained)
elif config is not None:
model = BloomModel(config)
else:
model = BloomModel(BloomConfig())

value_head = nn.Linear(model.config.hidden_size, 1)
value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.hidden_size + 1))
super().__init__(model, value_head, lora_rank, lora_train_bias)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class RewardModel(LoRAModule):
"""
Reward model base class.

Args:
model (nn.Module): Reward model.
value_head (nn.Module): Value head to get reward score.
lora_rank (int): LoRA rank.
lora_train_bias (str): LoRA bias training mode.
"""

def __init__(self,
model: nn.Module,
value_head: Optional[nn.Module] = None,
lora_rank: int = 0,
lora_train_bias: str = 'none') -> None:
super().__init__(lora_rank=lora_rank, lora_train_bias=lora_train_bias)
self.model = model
self.convert_to_lora()

if value_head is not None:
if value_head.out_features != 1:
raise ValueError("The value head of reward model's output dim should be 1!")
self.value_head = value_head
else:
self.value_head = nn.Linear(model.config.n_embd, 1)

def forward(self, sequences: torch.LongTensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
outputs = self.model(sequences, attention_mask=attention_mask)
last_hidden_states = outputs['last_hidden_state']
values = self.value_head(last_hidden_states)[:, :-1]
value = values.mean(dim=1).squeeze(1) # ensure shape is (B)
return value

损失函数代码

loss(θ)=1(K2)E(x,yw,yl)D[log(σ(rθ(x,yw)rθ(x,yl)))]loss(\theta)=-\frac{1}{K \choose 2}E_{(x,y_w,y_l)\sim D}[log(\sigma(r_\theta(x,y_w)-r_\theta(x,y_l)))]

1
2
3
4
5
6
7
8
9
10
11
class LogSigLoss(nn.Module):
"""
Pairwise Loss for Reward Model
Details: https://arxiv.org/abs/2203.02155
"""

def forward(self, chosen_reward: torch.Tensor, reject_reward: torch.Tensor) -> torch.Tensor:
probs = torch.sigmoid(chosen_reward - reject_reward)
log_probs = torch.log(probs)
loss = -log_probs.mean()
return loss

代码运行参数

1
2
3
4
5
6
7
torchrun --standalone --nproc_per_node=1 train_reward_model.py \
--model 'bloom' \
--strategy colossalai_zero2 \
--loss_fn 'log_sig' \
--dataset 'Anthropic/hh-rlhf' \
--pretrain ~/model/bloom-560m \
--tokenizer ~/model/bloom-560m

运行结果

Step Loss Dist Acc
344317 0.09899902 0.38238678 0.61886833

训练奖励模型
https://wangyinan.cn/训练奖励模型
作者
yinan
发布于
2023年9月3日
许可协议