基于BERT的中文评价情感分析

graph TD; %% 主流程 A[数据处理] --> B[加载数据集]; B --> C[数据集格式]; C --> D[制作Dataset]; D --> E[MyDataset类]; %% MyDataset类的详细方法 E --> F[__init__方法]; E --> G[__len__方法]; E --> H[__getitem__方法]; %% 其他主流程步骤 I[字典操作] --> J[Vocab字典操作]; J --> K[词汇表]; J --> L[文本转换]; M[模型设计] --> N[下游任务模型设计]; N --> O[模型结构]; N --> P[模型初始化]; Q[模型训练] --> R[自定义模型训练]; R --> S[数据加载]; R --> T[优化器]; R --> U[循环训练]; V[测试评估] --> W[测试与评估]; W --> X[计算测试损失]; W --> Y[评估模型性能]; Y --> Z[输出评估结果]; %% 连接主流程 A --> I; I --> M; M --> Q; Q --> V; %% 添加样式以增强可读性 classDef main fill:#f96,stroke:#333,stroke-width:4px; classDef sub fill:#bbf,stroke:#000,stroke-width:2px; class A,I,M,Q,V main; class B,C,D,E,F,G,H,J,K,L,N,O,P,R,S,T,U,W,X,Y,Z sub;

模型结构

config.json

{
  "architectures": [
    "BertForMaskedLM"                  # 模型类型：BERT模型，带有掩码
  ],
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,                  # 模型隐藏层
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,           # 多头自注意模型：12个头
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 21128                  # 字典数量，字典中增加或者删除都要修改，与之对应
}

模型的net内容

# 加载预训练模型
import torch
from transformers import BertModel

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pretrained = BertModel.from_pretrained("/Volumes/Date/huggingface/model/bert-base-chinese/models--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f").to(DEVICE)

print(pretrained)



"""
BertModel(
  (embeddings): BertEmbeddings(                                # 编码层（词向量化）：文本位置编码编码为词向量
    (word_embeddings): Embedding(21128, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(                                      # 特征提取
    (layer): ModuleList(
      (0-11): 12 x BertLayer(                                 # 12层
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(                                           # 输出层
    (dense): Linear(in_features=768, out_features=768, bias=True) #输出728特征
    (activation): Tanh()
  )
)

"""

BERT（Bidirectional Encoder Representations from Transformers）模型由多个组件组成，每个组件负责不同的任务：

1. Embeddings 层（词向量化）

- Word Embeddings: 将输入文本中的每个单词映射为一个固定维度的向量（例如768维）。这是通过查找预训练的嵌入矩阵来实现的。

- Position Embeddings: 由于BERT是双向的，它需要位置信息来理解句子中单词的顺序。位置嵌入为每个单词添加了其在句子中的位置信息。

- Token Type Embeddings: 用于区分不同句子（例如在句子对任务中），以便模型知道哪些标记属于哪个句子。

- LayerNorm: 对嵌入进行归一化处理，以确保数值稳定。

- Dropout: 随机丢弃一部分神经元，防止过拟合。

2. Encoder 层（特征提取）

- BertEncoder: 包含多个 BertLayer，每个 BertLayer 又包含两个子模块：`Attention` 和 Feed-Forward Network。

- Attention:

- Self-Attention: 计算输入序列中每个单词与其他单词之间的关系。这使得模型可以关注到上下文中的相关信息。

- Output: 对自注意力的结果进行线性变换，并应用归一化和dropout。

- Intermediate:

- Linear: 扩展维度（例如从768维扩展到3072维），并应用激活函数（如GELU）。

- Output:

- Linear: 缩小维度（例如从3072维缩小回768维），并应用归一化和dropout。

3. Pooler 层（输出层）

- BertPooler: 用于生成一个固定长度的向量表示整个输入序列。通常取 [CLS] 标记对应的隐藏状态，并通过线性变换和激活函数（如Tanh）得到最终表示。

graph LR; A[输入] --> B(Embeddings 层); B --> C(Encoder 层); C --> D(Pooler 层); D --> E(输出);

graph TB; %% Pooler 层 subgraph "Pooler 层" direction TB D1[Linear] D2[Tanh] D1 --> D2 end %% Encoder 层 subgraph "Encoder 层" direction TB C1[BertEncoder] subgraph BertLayer L1[Attention] L2[Intermediate] L3[Output] L1 --> L2 --> L3 end C1 --> L1 C1 --> L2 C1 --> L3 end %% Embeddings 层 subgraph "Embeddings 层" direction TB B1[Word Embeddings] B2(Position Embeddings) B3(Token Type Embeddings) B4(LayerNorm) B5(Dropout) B1 --> B2 --> B3 --> B4 --> B5 end

微调

微调是指在预训练模型的基础上，通过进一步的训练来适应特定的下游任务。BERT 模型通过预训练来学习语言的通用模式，然后通过微调来适应特定任务，如情感分析、命名实体识别等。微调过程中，通常冻结 BERT 的预训练层，只训练与下游任务相关的层。

数据集处理

加载数据集

from datasets import list_datasets,load_dataset,load_from_disk

# print(list_datasets())

#在线加载数据集
# dataset = load_dataset(path="NousResearch/hermes-function-calling-v1",split="train",cache_dir="dataset/")
# #缓存数据到本地
# dataset.save_to_disk(dataset_path="dataset/mydata/")
# print(dataset)
# #转为csv格式
# dataset.to_csv(path_or_buf="dataset/hermes-function-calling-v1.csv")
# 加载磁盘数据（huggingface）
# dataset = load_from_disk(r"D:\PycharmProjects\demo_14\dataset_test\dataset\mydata")
# print(dataset)
#加载csv数据
# dataset = load_dataset(path="csv",data_files=r"D:\PycharmProjects\demo_14\dataset_test\dataset\hermes-function-calling-v1.csv")
# print(dataset)

# 加载磁盘数据（huggingface）
dataset = load_from_disk(r"D:\PycharmProjects\demo_14\dataset_test\dataset\ChnSentiCorp")
print(dataset)
#取出训练集
train_dataset = dataset["train"]
print(train_dataset)
#查看数据
for data in train_dataset:
    print(data)

数据集格式

Hugging Face 的 datasets 库支持多种数据集格式，如 CSV、JSON、TFRecord 等。在本案例中，使用CSV 格式，CSV 文件应包含两列：一列是文本数据，另一列是情感标签。

数据集信息

加载数据集后，可以查看数据集的基本信息，如数据集大小、字段名称等。这有助于我们了解数据的分布情况，并在后续步骤中进行适当的处理。

{
  "builder_name": "chn_senti_corp",
  "citation": "",
  "config_name": "default",
  "dataset_size": 3871919,
  "description": "",
  "download_checksums": {
    "https://drive.google.com/u/0/uc?id=1uV-aDQoMI51A27OxVgJnzxqZFQqkDydZ&export=download": {
      "num_bytes": 3032906,
      "checksum": "5394d3b2867ece9452a275fc84924ae030d827bb57b90c6aed6009ffdd1b8cee"
    },
    "https://drive.google.com/u/0/uc?id=1kI_LUYm9m0pVGDb4LHqtjwauUrIGo_rC&export=download": {
      "num_bytes": 375862,
      "checksum": "a8770badf76a638fb21f92d5be9280eb8c50a7644dbe2dd881aff040c6fe6513"
    },
    "https://drive.google.com/u/0/uc?id=1YJkSqGFeo-8ifmgVbxGMTte7Y-yxPHg7&export=download": {
      "num_bytes": 371381,
      "checksum": "5ffc04692e4f7bf6b18eb9f94cef3090bd5fb55bb3e138ac7080f0739529ef1f"
    }
  },
  "download_size": 3780149,
  "features": {
    "text": {
      "dtype": "string",
      "id": null,
      "_type": "Value"
    },
    "label": {                # 标签
      "num_classes": 2,
      "names": [
        "negative",           # 分类，积极：消极
        "positive"
      ],
      "id": null,
      "_type": "ClassLabel"
    }
  },
  "homepage": "",
  "license": "",
  "post_processed": null,
  "post_processing_size": null,
  "size_in_bytes": 7652068,
  "splits": {
    "train": {                # 训练集
      "name": "train",
      "num_bytes": 3106365,
      "num_examples": 9600,
      "dataset_name": "chn_senti_corp"
    },
    "validation": {           # 校验集
      "name": "validation",
      "num_bytes": 385021,
      "num_examples": 1200,
      "dataset_name": "chn_senti_corp"
    },
    "test": {                 # 测试集
      "name": "test",
      "num_bytes": 380533,
      "num_examples": 1200,
      "dataset_name": "chn_senti_corp"
    }
  },
  "supervised_keys": null,
  "task_templates": null,
  "version": {
    "version_str": "0.0.0",
    "description": null,
    "major": 0,
    "minor": 0,
    "patch": 0
  }
}

制作Dataset

加载数据集后，需要对其进行处理以适应模型的输入格式。这包括数据清洗、格式转换等操作。

from datasets import Dataset
# 制作 Dataset
dataset = Dataset.from_dict({
'text': ['位置尚可，但距离海边的位置比预期的要差的多', '5月8日付款成功，当当网显示5月10
日发货，可是至今还没看到货物，也没收到任何通知，简不知怎么说好！！！', '整体来说，本书还是不错
的。至少在书中描述了许多现实中存在的司法系统方面的问题，这是值得每个法律工作者去思考的。尤其是让
那些涉世不深的想加入到律师队伍中的年青人，看到了社会特别是中国司法界真实的一面。缺点是：书中引用
了大量的法律条文和司法解释，对于已经是律师或有一定工作经验的法律工作者来说有点多余，而且所占的篇
幅不少，有凑字数的嫌疑。整体来说还是不错的。不要对一本书提太高的要求。'],
'label': [0, 1, 1] # 0 表示负向评价，1 表示正向评价
})
# 查看数据集信息
print(dataset)

数据集字段

在制作 Dataset 时，需定义数据集的字段。在本案例中，定义了两个字段： text （文本）和label （情感标签）。每个字段都需要与模型的输入和输出匹配。

from torch.utils.data import Dataset
from datasets import load_from_disk

class MyDataset(Dataset):
    # 通过 split 参数选择不同的数据集部分，便于在训练、验证和测试阶段使用不同的数据。
    def __init__(self, split):
        # 函数用于从磁盘加载已经保存的数据集。确保数据集路径正确且数据集已保存。
        self.dataset = load_from_disk("/Volumes/Date/huggingface/dataset/ChnSentiCorp")
        # 根据 split 参数选择数据集部分
        if split == "train":
            self.dataset = self.dataset["train"]
        elif split == "validation":
            self.dataset = self.dataset["validation"]
        elif split == "test":
            self.dataset = self.dataset["test"]

    def __len__(self):
        # 返回数据集的长度
        return len(self.dataset)

    def __getitem__(self, item):
        # 根据索引获取样本的文本和标签
        text = self.dataset[item]["text"]
        label = self.dataset[item]["label"]
        # 返回包含文本和标签的字典
        return text,label

“”“
('我11月9日下的订单,可到现在了还是杳无音信，还有首页不是打出６．８折封顶的宣传吗，但实际却有超过６．８折的东西在销售，电话咨询回答只能以购物车的价格为准，当当的服务真的要改善一下，不要以为做大了就不思进取了．', 0)
('敢怒敢言！为我们指点迷津！为中国指明了前进的方向！为世人敲响了警钟！中国的上层精英们睁开你们沉睡的双眼吧！不要每天醉生梦死了！不能再奴颜婢膝！讨好外国人了！认清历史与现实！我们是中国人！骨子里流着中国的血！为了我们腾飞的祖国贡献应有的贡献和作为吧！我虽身为一介平民学子！但我最起码还知道国家兴亡匹夫有责！祖国的未来需要大家的协同努力！好了！就说这些吧！', 1)
('全新的外观和16:9宽屏设计，采用当前主流的45nm芯及16:9黄金比例宽屏，紧跟随市面主流脚步。整机线条很圆润，“acer”LOGO内嵌A盖正中央，灯管的照射下显得十分抢眼，屏幕方面则跟随主流市场，采用14寸LED背光屏，显示效果相当好，16：9规格的分辨率也更好地配合了高清影音，尤其值得称赞的是键盘运用了类似索尼笔记本特色的巧克力键盘，键位突出，并且多媒体快捷键的位置也重新设计。整机更扎实靓丽', 1)
('东西很小巧，外观挺漂亮的，有珍珠漆的效果，外观工艺做的还是不错，塑料模具不错，速度也挺好的。', 1)
('宏基，自带linux.开机无法使用，咨询售后是由于没有完全安装好。所以只能另装XP 宏基自带手册不全，说明书简单，差不多都是英文 自带软件为英文界面，安装不便 安装完成后，系统无声音 发热大', 0)
('李嘉诚说过一个人35岁之前要努力赚钱，35岁之后就要考虑以钱生钱。其实即使没看过李富豪说的这句话，好象人一过了30岁都会开始考虑养老的问题，于是如何理财实现财富增长就变得日益重要起来，我就是这样。当理财的愿望日益迫切，面对众多的理财产品，首先想到的就是找本好书来看看。这本书很实用，言之有物，有理论也有实务，可以帮助大家了解各种理财产品进而选择适合自己的。总之，读后受益匪浅。', 1)
('都是谁说的好啊，我也买了。可是看不懂，我妈看了也说没啥用。哎。。。。。。。。', 0)
”“”

load_dataset ：用于从磁盘加载数据集，可以支持多种格式（CSV、JSON 等）。
len 方法：返回数据集的大小，便于之后在训练中使用。
getitem 方法：根据索引返回单个数据条目（文本和标签），这是标准的 PyTorchDataset 类格式。

数据集信息

制作 Dataset 后，可以通过 dataset.info 等方法查看其大小、字段名称等信息，以确保数据集的正确性和完整性。

DatasetInfo(description = '', citation = '', homepage = '', license = '', features = {
	'text': Value(dtype = 'string', id = None),
	'label': ClassLabel(names = ['negative', 'positive'], id = None)
}, post_processed = None, supervised_keys = None, builder_name = 'chn_senti_corp', dataset_name = None, config_name = 'default', version = 0.0 .0, splits = {
	'train': SplitInfo(name = 'train', num_bytes = 3106365, num_examples = 9600, shard_lengths = None, dataset_name = 'chn_senti_corp'),
	'validation': SplitInfo(name = 'validation', num_bytes = 385021, num_examples = 1200, shard_lengths = None, dataset_name = 'chn_senti_corp'),
	'test': SplitInfo(name = 'test', num_bytes = 380533, num_examples = 1200, shard_lengths = None, dataset_name = 'chn_senti_corp')
}, download_checksums = {
	'https://drive.google.com/u/0/uc?id=1uV-aDQoMI51A27OxVgJnzxqZFQqkDydZ&export=download': {
		'num_bytes': 3032906,
		'checksum': '5394d3b2867ece9452a275fc84924ae030d827bb57b90c6aed6009ffdd1b8cee'
	},
	'https://drive.google.com/u/0/uc?id=1kI_LUYm9m0pVGDb4LHqtjwauUrIGo_rC&export=download': {
		'num_bytes': 375862,
		'checksum': 'a8770badf76a638fb21f92d5be9280eb8c50a7644dbe2dd881aff040c6fe6513'
	},
	'https://drive.google.com/u/0/uc?id=1YJkSqGFeo-8ifmgVbxGMTte7Y-yxPHg7&export=download': {
		'num_bytes': 371381,
		'checksum': '5ffc04692e4f7bf6b18eb9f94cef3090bd5fb55bb3e138ac7080f0739529ef1f'
	}
}, download_size = 3780149, post_processing_size = None, dataset_size = 3871919, size_in_bytes = 7652068)

Vocab字典操作

在微调 BERT 模型之前，需要将模型的词汇表（vocab）与数据集中的文本匹配。这一步骤确保输入的文本能够被正确转换为模型的输入格式。

from transformers import BertTokenizer
from datasets import load_from_disk

# 加载磁盘数据（huggingface）
dataset = load_from_disk("/Volumes/Date/huggingface/dataset/ChnSentiCorp")
print(dataset)
# 取出训练集
train_dataset = dataset["train"]
# 查看数据
for data in train_dataset:
    print(data)

# 加载 BERT 模型的 vocab 字典
tokenizer = BertTokenizer.from_pretrained(
    "/Volumes/Date/huggingface/model/bert-base-chinese/models--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f")
# 将数据集中的文本转换为 BERT 模型所需的输入格式
dataset_info = dataset.map(lambda x: tokenizer(x['text'],
                                               add_special_tokens=True,
                                               max_length=50,
                                               truncation=True,
                                               padding=True,
                                               return_tensors="pt"), batched=True)
# 查看数据集信息
print(dataset_info)


"""
Map: 100%|██████████| 9600/9600 [00:02<00:00, 3564.38 examples/s]
Map: 100%|██████████| 1200/1200 [00:00<00:00, 2979.70 examples/s]
Map: 100%|██████████| 1200/1200 [00:00<00:00, 3547.16 examples/s]
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9600
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1200
    })
})

"""

return_tensors 是 Hugging Face 的 datasets 和 transformers 库中常用的一个参数，用于指定返回张量的类型。它通常出现在数据集处理和模型输入准备的过程中，特别是在使用 Dataset.map 方法或 DataCollator 类时。

return_tensors 参数决定了返回的数据格式是哪种类型的张量。常见的选项包括：

pt：返回 PyTorch 张量（torch.Tensor）。
tf：返回 TensorFlow 张量（tf.Tensor）。
np：返回 NumPy 数组（numpy.ndarray）。

数据预处理：在对数据集进行预处理时，可以使用 return_tensors 参数来确保输出的数据格式与所使用的深度学习框架兼容。

模型输入：在将数据传递给模型之前，确保数据是以正确的张量格式提供的，以便模型可以直接使用这些数据进行训练或推理。

词汇表(vocab)

BERT 模型使用词汇表（vocab）将文本转换为模型可以理解的输入格式。词汇表包含所有模型已知的单词及其对应的索引。确保数据集中的所有文本都能找到对应的词汇索引是至关重要的。

99   [unused98]
101  [UNK]
102  [MASK]
2635 <S>
2635 !
3746 风
3846 ##风

文本转换

使用 tokenizer 将文本分割成词汇表中的单词，并转换为相应的索引。此步骤需要确保文本长度、特殊字符处理等都与 BERT 模型的预训练设置相一致。

input_ids : [[101, 4635, 3189, 898, 2255, 2226, 102], [101, 2577, 4708, 1282, 1146, 4080, 1220, 4638, 2552, 2658, 3123, 3216, 8024, 1377, 3221, 4692, 4708, 4692, 4708, 1355, 4385, 8024, 1762, 3123, 3216, 2130, 3684, 1400, 8024, 1139, 4385, 671, 7415, 5101, 5439, 7962, 4638, 1220, 4514, 4275, 8013, 2458, 1993, 6820, 2577, 4542, 3221, 679, 3221, 6615, 6843, 4638, 702, 1166, 4385, 6496, 8024, 1377, 3221, 1400, 3341, 1355, 4385, 3680, 2476, 100, 1400, 7481, 6963, 3300, 8013, 4696, 679, 4761, 6887, 4495, 772, 1555, 2582, 720, 2682, 4638, 8024, 2769, 2682, 4692, 4638, 3221, 4344, 1469, 5439, 7962, 8024, 679, 3221, 5101, 5439, 7962, 8013, 1963, 3362, 1322, 2157, 3221, 2682, 6615, 6843, 4638, 6413, 8024, 6929, 2218, 1059, 1947, 5101, 5439, 7962, 1469, 1538, 5439, 7890, 6963, 6615, 6843, 8024, 1372, 1762, 3680, 2476, 100, 1400, 7481, 3924, 1217, 671, 7415, 5050, 784, 720, 8043, 8043, 5042, 4684, 3221, 4514, 6026, 3924, 6639, 8013, 8013, 102]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
length : [7, 151]
attention_mask : [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

下游任务模型设计

在微调 BERT 模型之前，需要设计一个适应情感分析任务的下游模型结构。通常包括一个或多个全连接层，用于将 BERT 输出的特征向量转换为分类结果。

#加载预训练模型
from transformers import BertModel
import torch


DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pretrained = BertModel.from_pretrained(
    "/Volumes/Date/huggingface/model/bert-base-chinese/models--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f").to(
    DEVICE)

#定义下游任务（将主干网络所提取的特征进行分类）
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768,2) # 增加全连接层

    def forward(self,input_ids,attention_mask,token_type_ids):
        #冻结主干网络权重
        with torch.no_grad():
            out = pretrained(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
        out = self.fc(out.last_hidden_state[:,0])
        out = out.softmax(dim=1)
        return out

BertModel.from_pretrained() ：加载 Hugging Face 提供的预训练 BERT 模型，这里使用的是 bert-base-chinese 。
with torch.no_grad() ：防止梯度计算，冻结 BERT 模型的权重，只训练下游任务的全连接层。
self.fc ：定义一个全连接层，用于将 BERT 的输出映射到分类结果。BERT 的输出向量维度为 768，因此输入到 fc 层的维度为 768，输出维度为分类的类别数（这里是2分类任务）。
last_hidden_state[:, 0] ：提取 BERT 的 [CLS] 标记对应的向量，通常用于分类任务的最终输出。

模型结构

下游任务模型通常包括以下几个部分：

BERT 模型：用于生成文本的上下文特征向量。

Dropout 层：用于防止过拟合，通过随机丢弃一部分神经元来提高模型的泛化能力。

全连接层：用于将 BERT 的输出特征向量映射到具体的分类任务上。

模型初始化

使用 BertModel.from_pretrained() 方法加载预训练的 BERT 模型，同时也可以初始化自定义的全连

接层。初始化时，需要根据下游任务的需求，定义合适的输出维度。

自定义模型训练

模型设计完成后，进入训练阶段。通过数据加载器（DataLoader）高效地批量处理数据，并使用优化器更新模型参数。

import torch
from MyData import MyDataset
from torch.utils.data import DataLoader
from net import Model
from transformers import AdamW, BertTokenizer

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
EPOCH = 10  # 减少轮次以便于展示

token = BertTokenizer.from_pretrained(
    "/Volumes/Date/huggingface/model/bert-base-chinese/models--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f")

def collate_fn(data):
    sentes = [i[0] for i in data]
    label = [i[1] for i in data]
    data = token.batch_encode_plus(batch_text_or_text_pairs=sentes,
                                   truncation=True,
                                   padding="max_length",
                                   max_length=500,
                                   return_tensors="pt",
                                   return_length=True)
    input_ids = data["input_ids"]
    attention_mask = data["attention_mask"]
    token_type_ids = data["token_type_ids"]
    labels = torch.LongTensor(label)
    return input_ids, attention_mask, token_type_ids, labels

# 创建数据集
train_dataset = MyDataset("train")
# 创建dataloader
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=100,
                          shuffle=True,
                          drop_last=True,
                          collate_fn=collate_fn)

def plot_chars(data, title, max_value, min_value, char='*'):
    scale = 50 / (max_value - min_value)
    print(title)
    for value in data:
        num_chars = int((value - min_value) * scale)
        print(f"{value:.4f} {' ' * (10 - len(f'{value:.4f}'))}{' ' * (50 - num_chars)}{char * num_chars}")

if __name__ == '__main__':
    # 开始训练
    print(DEVICE)
    model = Model().to(DEVICE)
    optimizer = AdamW(model.parameters(), lr=5e-4)
    loss_func = torch.nn.CrossEntropyLoss()

    model.train()
    losses = []
    accuracies = []
    for epoch in range(EPOCH):
        for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(train_loader):
            input_ids, attention_mask, token_type_ids, labels = input_ids.to(DEVICE), attention_mask.to(
                DEVICE), token_type_ids.to(DEVICE), labels.to(DEVICE)
            out = model(input_ids, attention_mask, token_type_ids)

            loss = loss_func(out, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if i % 5 == 0:
                out = out.argmax(dim=1)
                acc = (out == labels).sum().item() / len(labels)
                print(f"Epoch {epoch}, Batch {i}, Loss: {loss.item():.4f}, Accuracy: {acc:.4f}")
                losses.append(loss.item())
                accuracies.append(acc)
        torch.save(model.state_dict(), f"/Volumes/Date/test/{epoch}bert01.pth")
        print(f"Epoch {epoch} - 参数保存成功！")

    # 绘制损失和精度的趋势图
    plot_chars(losses, "Loss Trend", max(losses), min(losses), char='-')
    plot_chars(accuracies, "Accuracy Trend", max(accuracies), min(accuracies), char='+')

DataLoader ：用于批量加载数据，每次加载 batch_size=100 的样本并随机打乱数据。
BertTokenizer ：用于将文本转换为模型可处理的 token。 batch_encode_plus 方法负责将文本转换为 input_ids 、 attention_mask 等，模型需要这些作为输入。
AdamW ：是一种适合 BERT 等大型语言模型的优化器，结合了 Adam 和权重衰减，防止过拟合。

循环训练

训练循环包含前向传播（forward pass）、损失计算（loss calculation）、反向传播（backwardpass）、参数更新（parameter update）等步骤。每个 epoch 都会对整个数据集进行一次遍历，更新模型参数。

Loss Trend
0.7102     --------------------------------------------------
0.6652                 --------------------------------------
0.6218                            ---------------------------
0.5750                                       ----------------
0.5577                                            -----------
0.5413                                                -------
0.5231                                                    ---
0.5138                                                       
0.5102                                                       
0.5113                                                       
Accuracy Trend
0.4550                                                       
0.6750                              +++++++++++++++++++++++++
0.8050               ++++++++++++++++++++++++++++++++++++++++
0.8900     ++++++++++++++++++++++++++++++++++++++++++++++++++
0.8750       ++++++++++++++++++++++++++++++++++++++++++++++++
0.8450           ++++++++++++++++++++++++++++++++++++++++++++
0.8850      +++++++++++++++++++++++++++++++++++++++++++++++++
0.8450           ++++++++++++++++++++++++++++++++++++++++++++
0.8550          +++++++++++++++++++++++++++++++++++++++++++++
0.8500          +++++++++++++++++++++++++++++++++++++++++++++

测试与评估

评估训练

xychart-beta title "Training Loss and Accuracy over Batches" x-axis "d" 1 --> 50 y-axis "Revenue (in $)" 0.3 --> 1 line [0.6780, 0.6395, 0.6088, 0.5933, 0.5551, 0.5235, 0.5401, 0.5155, 0.4923, 0.4937, 0.5086, 0.4826, 0.5009, 0.4871, 0.4649, 0.4692, 0.4589, 0.4620, 0.4517, 0.4591, 0.4647, 0.4336, 0.4534, 0.4629, 0.4876, 0.4464, 0.4460, 0.4985, 0.4593, 0.4537, 0.4572, 0.4604, 0.4321, 0.4553, 0.4070, 0.4180, 0.3550, 0.3541, 0.3210, 0.3470, 0.3340, 0.3258, 0.3389, 0.3151, 0.3465, 0.3549, 0.3441, 0.3401, 0.3240, 0.3265, 0.3419, 0.3061, 0.317] line [0.5900, 0.7300, 0.8050, 0.7600, 0.8200, 0.8700, 0.8250, 0.8700, 0.8900, 0.8750, 0.8400, 0.8750, 0.8450, 0.8650, 0.8950, 0.8800, 0.8700, 0.8750, 0.8900, 0.8700, 0.8750, 0.9100, 0.8850, 0.8800, 0.8250, 0.8800, 0.8900, 0.8150, 0.8700, 0.8600, 0.8650, 0.8700, 0.8700, 0.8550, 0.8700, 0.8500, 0.8950, 0.8650, 0.8500, 0.8850, 0.8650, 0.9000, 0.8950, 0.8850, 0.9100, 0.8850, 0.8600, 0.8700, 0.8850, 0.8550, 0.8800, 0.9000, 0.9150, 0.8750, 0.9050, 0.8700, 0.8700, 0.8900, 0.9200, 0.8700, 0.8850, 0.9150, 0.8850, 0.9100, 0.9050, 0.8900, 0.8950, 0.8650, 0.8700, 0.8800, 0.8650, 0.9400, 0.8850]

测试

import torch
from net import Model
from transformers import BertTokenizer

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
names = ["负向评价","正向评价"]
print(DEVICE)
model = Model().to(DEVICE)


token = BertTokenizer.from_pretrained(r"D:\PycharmProjects\demo_14\model\bert-base-chinese\models--bert-base-chinese\snapshots\c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f")

def collate_fn(data):
    sentes = []
    sentes.append(data)
    # print(sentes)
    #编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sentes,
                            truncation=True,
                            padding="max_length",
                            max_length=500,
                            return_tensors="pt",
                            return_length=True)
    input_ids = data["input_ids"]
    attention_mask = data["attention_mask"]
    token_type_ids = data["token_type_ids"]

    # print(input_ids,attention_mask,token_type_ids)
    return input_ids,attention_mask,token_type_ids

def test():
    model.load_state_dict(torch.load("params/2bert01.pth"))
    model.eval()
    while True:
        data = input("请输入测试数据(输入'q'退出)：")
        if data == "q":
            print("测试结束")
            break
        input_ids, attention_mask, token_type_ids= collate_fn(data)
        input_ids, attention_mask, token_type_ids = input_ids.to(DEVICE), attention_mask.to(
            DEVICE), token_type_ids.to(DEVICE)

        with torch.no_grad():
            out = model(input_ids,attention_mask,token_type_ids)
            out = out.argmax(dim=1)
            print("模型判定：",names[out],"\n")
if __name__ == '__main__':
    test()

clearwind

Fine-Tuning Base BERT With 2 Label

模型结构

config.json

模型的net内容

微调

数据集处理

加载数据集

数据集格式

数据集信息

制作Dataset

数据集字段

数据集信息

Vocab字典操作

词汇表(vocab)

文本转换

下游任务模型设计

模型结构

模型初始化

自定义模型训练

循环训练

测试与评估

评估训练

测试

简述 Transformer 训练计算过程（刷新渲染） 2025-01-11 23:54

基于Deepseek的AI试题问答 2025-02-28 09:46

基于 internlm2 和 LangChain 搭建你的知识库 2025-02-27 14:25

xtuner微调大模型 2025-02-26 09:31

Llama-Factory 微调全过程 2025-01-13 22:28

矩阵分解 2025-01-11 15:48

目录