前言

李沐《动手学深度学习(PyTorch版)》系列课程中的第二个竞赛,尝试做一做。

视频链接:30 第二部分完结竞赛:图片分类【动手学深度学习v2】_哔哩哔哩_bilibili

课程主页:https://courses.d2l.ai/zh-v2/

教材:https://zh-v2.d2l.ai/

此外感谢Neko Kiku的部分代码基础(Neko Kiku | Contributor | Kaggle


题目

Classify Leaves-Train models to predict the plant species

(176种叶子分类任务)

链接:Classify Leaves | Kaggle

实战

结构目录如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|--Classify-leaves
| |-- Code
| | |-- cnn_code.ipynb
| | |-- pre_process1.ipynb
| | `-- pre_process2.ipynb
| |-- dataset
| | |-- images
| | |-- label.json
| | |-- sample_submission.csv
| | |-- submission.csv
| | |-- test.csv
| | |-- train.csv
| | `-- validation.csv
`-- model
`-- res_model.ckpt

将下载的数据集解压至dataset目录中

数据预处理

这个任务我最开始做了很久,发现结果一直都是0.00x,半天也不知道哪里出了问题,就暂时搁置了一段时间,后面看到Neko Kiku对数据的可视化展示,对我有了一些启发,便开始重新思考。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#function to show bar length

def barw(ax):

for p in ax.patches:
val = p.get_width() #height of the bar
x = p.get_x()+ p.get_width() # x- position
y = p.get_y() + p.get_height()/2 #y-position
ax.annotate(round(val,2),(x,y))

#finding top leaves

plt.figure(figsize = (15,30))
ax0 =sns.countplot(y=labels_dataframe['label'],order=labels_dataframe['label'].value_counts().index)
barw(ax0)
plt.show()

img

可视化数据集可以看到,树叶总共有176类,每一类数量不同,最多可以达到353个,最少有51个。我一开时按照 训练集:验证集=9:1 的比例进行划分,验证集也是随手从数据集中分割的,可能就会导致有些类型数量很少甚至没有。

所以这里我的处理思路是,按照训练集:验证集=8:2 的比例进行划分,且验证集是分别从每一类中抽取20%进行构建,保证每一类的数量。

重新规定标签

从官网下载的数据集中train.csv文件的内容如下:

1
2
3
4
import pandas as pd

labels_dataframe = pd.read_csv('../dataset/train.csv')
print(labels_dataframe.iloc[0:4])
1
2
3
4
5
          image             label
0 images/0.jpg maclura_pomifera
1 images/1.jpg maclura_pomifera
2 images/2.jpg maclura_pomifera
3 images/4.jpg maclura_pomifera

标签为文本,在训练时不可用,需要转化为整形。创建pre_process1.ipynb

1
2
3
import os
import pandas as pd
import json
1
2
3
dataset_ROOT = '../dataset/'         # 数据集根目录
train_FILE = os.path.join(dataset_ROOT, "train.csv")
labelJSON_FILE = os.path.join(dataset_ROOT, "label.json")
1
2
3
4
5
6
data_label = pd.read_csv(train_FILE)

# 提取"label"列的数据并去重
labels = data_label['label'].unique()

print('共有{}类'.format(len(labels)))
1
共有176类

将标签和对应的整形(0~175)数字保存至label.json文件中:

1
2
3
4
5
6
7
8
9
10
11
12
data = []
for index, label in enumerate(labels):
item = {
"label_str": label,
"label_num": index
}
index + 1
data.append(item)

# 写入JSON文件
with open(labelJSON_FILE, 'w') as file:
json.dump(data, file)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[
{
"label_str": "maclura_pomifera",
"label_num": 0
},
{
"label_str": "ulmus_rubra",
"label_num": 1
},
{
"label_str": "broussonettia_papyrifera",
"label_num": 2
},
...
{
"label_str": "juniperus_virginiana",
"label_num": 175
}
]

注:一开始我的标注为1~176,但是在实际训练中就会报错

1
2
3
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?fe98a7bb-729b-462b-92b9-89f4d79ef849) or open in a [text editor](command:workbench.action.openLargeOutput?fe98a7bb-729b-462b-92b9-89f4d79ef849). Adjust cell output [settings](command:workbench.action.openSettings?["@tag:notebookOutputLayout"])...

上网搜索原因是:训练数据中存在超出分类数目的标签

训练集/验证集划分

原数据集只有一个train.csv文件,现按照训练集:验证集=8:2 的比例进行划分,且验证集是分别从每一类中抽取20%进行构建,并生成一个validation.csv文件保存验证集数据。新建pre_process2.ipynb

1
2
3
4
import os
import pandas as pd
import random
import math
1
2
3
dataset_ROOT = '../dataset/'         # 数据集根目录
train_FILE = os.path.join(dataset_ROOT, "train.csv")
validation_FILE = os.path.join(dataset_ROOT, "validation.csv")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df_train = pd.read_csv(train_FILE)
label_classes = df_train['label'].unique()

df_validation = pd.DataFrame(columns=['image', 'label'])

# 根据每个标签类别的数量计算每个类别需要抽取的样本数
num_samples_per_class = {}
for label_class in label_classes:
total_samples = len(df_train[df_train['label'] == label_class])
num_samples = math.floor(total_samples * 0.2)
num_samples_per_class[label_class] = num_samples

# 随机剪切数据到 validation 数据集中
for label_class in label_classes:
samples = df_train[df_train['label'] == label_class].sample(n=num_samples_per_class[label_class], random_state=42)
df_validation = pd.concat([df_validation, samples])

# 从 train 数据集中移除已剪切的数据
df_train = df_train.drop(samples.index)

# 将 validation 数据集保存为 validation.csv 文件
df_validation.to_csv(validation_FILE, index=False)
# 重新保存 train.csv 文件
df_train.to_csv(train_FILE, index=False)

训练

创建cnn_code.ipynb文件:

1
2
3
4
5
6
7
8
9
10
import os
from torchvision import datasets, transforms
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import json
import torch
from torch import nn
import torchvision.models as models
from tqdm import tqdm
1
2
3
4
5
6
7
8
9
10
dataset_ROOT = '../dataset/'         # 数据集根目录
train_FILE = os.path.join(dataset_ROOT, "train.csv")
validation_FILE = os.path.join(dataset_ROOT, "validation.csv")
test_FILE = os.path.join(dataset_ROOT, "test.csv")
labelJSON_FILE = os.path.join(dataset_ROOT, "label.json")
submission_FILE = os.path.join(dataset_ROOT, "submission.csv")
model_path = '../model/'

if not os.path.exists(model_path):
os.makedirs(model_path)
1
2
3
4
5
6
# 超参数设置
batch_size = 8
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
learning_rate = 3e-4
weight_decay = 1e-3
epochs = 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 读取 label.json 文件
with open(labelJSON_FILE, 'r', encoding='utf-8') as file:
data = json.load(file)

# 将数据转换为字典(str -> num),训练/验证使用
label_dict_str2num = {item['label_str']: item['label_num'] for item in data}

# 打印转换后的字典
print(label_dict_str2num)
print(len(label_dict_str2num))

# 将数据转换为字典(num -> str),测试提交使用
label_dict_num2str = {item['label_num']: item['label_str'] for item in data}

print(label_dict_num2str)
print(len(label_dict_num2str))

自定义数据加载器(训练/验证)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 自定义数据集对象
class CustomDataset(Dataset):
def __init__(self, csv_file, root_dir, transform=None):
self.annotations = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform

def __len__(self):
return len(self.annotations)

def __getitem__(self, idx):
img_path = os.path.join(self.root_dir, self.annotations.iloc[idx, 0])
image = Image.open(img_path).convert("RGB")
label = label_dict_str2num.get(self.annotations.iloc[idx, 1])

if self.transform is not None:
image = self.transform(image)

return image, label

# 定义数据transformer
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor()
])

train_dataset = CustomDataset(train_FILE, dataset_ROOT, transform=transform)
validation_dataset = CustomDataset(validation_FILE, dataset_ROOT, transform=transform)

print("训练集的长度为:{}".format(len(train_dataset)))
print("验证集的长度为:{}".format(len(validation_dataset)))
1
2
训练集的长度为:14755
验证集的长度为:3598
1
2
3
# 创建数据加载器
train_loader = DataLoader(train_dataset , batch_size=batch_size, shuffle=False)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False)

搭建神经网络

这里使用的是ResNet34

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 是否要冻住模型的前面一些层
def set_parameter_requires_grad(model, feature_extracting):
if feature_extracting:
model = model
for param in model.parameters():
param.requires_grad = False
# resnet34模型
def res_model(num_classes, feature_extract = False, use_pretrained=True):

model_ft = models.resnet34(pretrained=use_pretrained)
set_parameter_requires_grad(model_ft, feature_extract)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Sequential(nn.Linear(num_ftrs, num_classes))

return model_ft
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
model = res_model(176) # 最后的分类数量:176
model = model.to(device)

# 设置损失函数
criterion = nn.CrossEntropyLoss()

# 设置优化器
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate, weight_decay=weight_decay)

best_acc = 0.0
for epoch in range(epochs):
# ---------- Training ----------
model.train()
# These are used to record information in training.
train_loss = []
train_accs = []

for batch in tqdm(train_loader):
imgs, labels = batch
imgs = imgs.to(device)
labels = labels.to(device)
logits = model(imgs)

loss = criterion(logits, labels)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# Compute the accuracy for current batch.
acc = (logits.argmax(dim=-1) == labels).float().mean()

# Record the loss and accuracy.
train_loss.append(loss.item())
train_accs.append(acc)

# The average loss and accuracy of the training set is the average of the recorded values.
train_loss = sum(train_loss) / len(train_loss)
train_acc = sum(train_accs) / len(train_accs)

# Print the information.
print(f"[ Train | {epoch + 1:03d}/{epochs:03d} ] loss = {train_loss:.5f}, acc = {train_acc:.5f}")


# ---------- Validation ----------
model.eval()
# These are used to record information in validation.
valid_loss = []
valid_accs = []

for batch in tqdm(validation_loader):
imgs, labels = batch
imgs = imgs.to(device)
labels = labels.to(device)

with torch.no_grad():
logits = model(imgs)

loss = criterion(logits, labels)

acc = (logits.argmax(dim=-1) == labels.to(device)).float().mean()

# Record the loss and accuracy.
valid_loss.append(loss.item())
valid_accs.append(acc)

# The average loss and accuracy for entire validation set is the average of the recorded values.
valid_loss = sum(valid_loss) / len(valid_loss)
valid_acc = sum(valid_accs) / len(valid_accs)

# Print the information.
print(f"[ Valid | {epoch + 1:03d}/{epochs:03d} ] loss = {valid_loss:.5f}, acc = {valid_acc:.5f}")


# if the model improves, save a checkpoint at this epoch
if valid_acc > best_acc:
best_acc = valid_acc
torch.save(model.state_dict(), os.path.join(model_path, "res_model.ckpt"))
print('saving model with acc {:.3f}'.format(best_acc))

预测结果并保存

自定义测试集数据加载器

逻辑和前面的一样,只是没有lable(完全可以将其编写在一起,利用if条件语句划分)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 自定义数据集对象
class TestDataset(Dataset):
def __init__(self, csv_file, root_dir, transform=None):
self.annotations = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform

def __len__(self):
return len(self.annotations)

def __getitem__(self, idx):
img_path = os.path.join(self.root_dir, self.annotations.iloc[idx, 0])
image = Image.open(img_path).convert("RGB")

if self.transform is not None:
image = self.transform(image)

return image

# 定义数据transformer
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor()
])

test_dataset = TestDataset(test_FILE, dataset_ROOT, transform=transform)

print("测试集的长度为:{}".format(len(test_dataset)))
1
测试集的长度为:8800
1
test_loader = DataLoader(test_dataset , batch_size=batch_size, shuffle=False)

测试部分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 选择模型
model = res_model(176)

model = model.to(device)
# 加载保存的模型参数
model.load_state_dict(torch.load(os.path.join(model_path, "res_model.ckpt")))

model.eval()

predictions = []
for batch in tqdm(test_loader):
imgs = batch
imgs = imgs.to(device)
with torch.no_grad():
logits = model(imgs)
predictions.extend(logits.argmax(dim=-1).cpu().numpy().tolist())

# 将整形label转化为字符型标签
preds = []
for i in predictions:
preds.append(label_dict_num2str.get(i))

# 保存预测结果,并写入 submission.csv 文件
test_data = pd.read_csv(test_FILE)
test_data['label'] = pd.Series(preds)
submission = pd.concat([test_data['image'], test_data['label']], axis=1)
submission.to_csv(submission_FILE, index=False)
print("Done!")

提交结果

在Kaggle上提交结果,在私榜上准确率大概在80%


后记

朋友用ResNet50可以调到95%差不多。