在連假安排聚會時,就打定主意一定要空下完整一天把進度趕上,把所有的Bug清掉,但環境問題(gpu)和模型設定蠻繁雜,老實說也不太確定一天到底夠不夠,但幸好今天整天的付出是有收穫的,結果不錯哩!
1.訓練中模型的程式碼:選用AlexNet預訓練模型,並用競賽資料進行遷移式學習(Transfer learning)
2.Debug紀錄:Debug_5~Debug7 & Hint_1~Hint_3
(1)整個訓練執行步驟: 使用競賽的所有的訓練資料(train_data),共33類別的農作物!
Step0: Set GPU environment
import torch
print(torch.cuda.is_available())
device = (torch.device('cuda') if torch.cuda.is_available()
else torch.device('cpu'))
print(f"Training on device {device}.")
Step1:Load the data & Create Dataset
batch_size = 10 # 一個batch有10張圖片
val_size, test_size = 0.1, 0.1 # train:val:test=0.8:0.1:0.1
shuffle_dataset = True
random_seed= 42
transform = transforms.Compose([
transforms.ToTensor(), # Debug_5, you can add other transformations in this list
#transforms.Resize(224), # but got [3, 298, 224] at entry 0 and [3, 398, 224]
transforms.Resize(size=(255, 255)) # Debug_6, Hint_2
#transforms.CenterCrop(255)
])
# Use all training data
dataset = torchvision.datasets.ImageFolder(ALL_data_path, transform=transform, target_transform=None) # Hint_1
print(dataset.class_to_idx) # Hint_3
Step2:Split Data & Create Dataloader
# Create data indices for train & validatin spilt
# Set seed and shuffle, then spilt data to train, validation, test
dataset_size = len(dataset)
indices = list(range(dataset_size))
val_spilt = int(np.floor(val_size * dataset_size))
test_spilt = val_spilt + int(np.floor(test_size * dataset_size))
if shuffle_dataset:
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, val_indices, test_indices = indices[test_spilt:], indices[:val_spilt], indices[val_spilt:test_spilt]
# print(train_indices, val_indices, test_indices)
# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
val_sampler = SubsetRandomSampler(val_indices)
test_sampler = SubsetRandomSampler(test_indices)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=train_sampler)
val_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=val_sampler)
test_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=test_sampler)
print(len(train_loader), len(val_loader), len(test_loader)) # 各自的batch數量: 7162 896 896
n_total_step = len(train_loader)
print(n_total_step) # 7162
dataloaders={}
dataloaders["train"] = train_loader
dataloaders["val"] = val_loader
dataset_sizes = {}
dataset_sizes["train"] = len(train_indices)
dataset_sizes["val"] = len(val_indices)
### Check
for index_batch, (images, labels_tensor) in islice(enumerate(train_loader),1,3):
# print(index_batch, (images, labels_tensor)) # 第index_batch個batch,(images, labels_tensor)代表該個batch裡面所有圖片的像素&對應標籤
# print(len(images)) # 每個batch有10張圖片,這個images存了某個batch共10張圖片的資訊
# plt.imshow(images) # Debug
# print(images[0].shape[-1]) # 255
print(images[0].shape, labels_tensor[0]) # 第0張圖片的(色彩通道數Channel, Height, Width)
plt.imshow(np.transpose(images[0].numpy(), (1, 2, 0))) # Debug_7, images[0]: 第1張圖片
print("="*10)
Step3:Use Pytorch pretrained model
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
for epoch in range(num_epochs):
print(f'Epoch {epoch}/{num_epochs - 1}')
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'val']:
if phase == 'train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for inputs, labels in tqdm(dataloaders[phase]):
#for i, (inputs, labels) in islice(enumerate(dataloaders[phase]),1,3):
print("=================="+"Train"+"==================")
#print("train_batch_number: {}".format(i))
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
if phase == 'train':
scheduler.step()
epoch_loss = running_loss / dataset_sizes[phase]
epoch_acc = running_corrects.double() / dataset_sizes[phase]
print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
# deep copy the model
if phase == 'val' and epoch_acc > best_acc:
best_acc = epoch_acc
best_model_wts = copy.deepcopy(model.state_dict())
print()
time_elapsed = time.time() - since
print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
print(f'Best val Acc: {best_acc:4f}')
# load best model weights
model.load_state_dict(best_model_wts)
return model
Step3-1.Set Pretrained model & Parameters
model_alexnet = torchvision.models.alexnet(pretrained=True).to(device) # load pretrained Alexnet model (pretrained weights)
print(model_alexnet)
print("="*10)
print(model_alexnet.classifier[6]) # 最後一層分類器
Step3-2.Train
# Transfer Learning: (方法_2)ConvNet as fixed feature extractor
for param in model_alexnet.parameters():
param.requires_grad = False # 不自動更新參數:把所有層數都凍結(freeze the weights)
# Parameters of newly constructed modules have requires_grad=True by default
num_ftrs = model_alexnet.classifier[6].in_features # 最後一層輸入器的輸入維度
model_alexnet.classifier[6] = torch.nn.Linear(num_ftrs, 33) # 33: numbers of output class
# model_alexnet.classifier[6] == model.fc
model_alexnet = model_alexnet.to(device)
criterion = torch.nn.CrossEntropyLoss()
# Only parameters of final layer are being optimized
optimizer_conv = torch.optim.SGD(model_alexnet.classifier[6].parameters(), lr=0.001, momentum=0.9) # 只對最後一層參數做更新(指定最後一層:model_conv.fc)
exp_lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)
### Train and evaluate
model_alexnet = train_model(model_alexnet, criterion, optimizer_conv,
exp_lr_scheduler, num_epochs=10)
目前進度如下: (設定跑10個Epoch,但1個Epoch就需跑約9小時,2022-10-09 19:15:32開始進行)
Step4.Save models
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model_alexnet.state_dict():
print(param_tensor, "\t", model_alexnet.state_dict()[param_tensor].size())
# Print optimizer's state_dict
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
print(var_name, "\t", optimizer.state_dict()[var_name])
torch.save(model_alexnet.state_dict(), output_path)
torch.save(model_alexnet, output_path)
model_scripted = torch.jit.script(model_alexnet) # Export to TorchScript
model_scripted.save('model_scripted_alexnet.pt') # Save
(1)Hint_1:若圖片按類別放在不同資料夾,可直接用torchvision.datasets.ImageFolder 建立dataset
train_data
│
└───lemon
│ │ 00abbbec-6228-4bbd-b777-57287b29a616.jpg
│ │ 00cac22f-0304-4f23-bdca-a2c7d0337c65.jpg
│
└───onion
│ 00b6ec1a-a376-4ffe-abf5-fbf90dab6340.jpg
│ 00b41d15-59ab-406b-a502-335c863593bf.jpg
(2)Debug_5:
transform = transforms.Compose([
# you can add other transformations in this list
transforms.ToTensor()])
(3)Debug_6:
官方: Crops the given image at the center.
Crop the given image at a random location.
Crop a random portion of image and resize it to a given size.
(4)Hint_2: transforms.Crop vs transforms.Resize
Resize the input image to the given size.
重置影像解析度
(5)Hint_3: When using ImageFolder class , use below to trace back the original label.
classes (list): List of the class names sorted alphabetically.
class_to_idx (dict): Dict with items (class_name, class_index).
imgs (list): List of (image path, class_index) tuples
(6)Debug_7: show image with matplotlib
plt.imshow(np.transpose(images[0].numpy(), (1, 2, 0)))
images[i]
, typically I use i=0. (挑選某個batch的第1張圖片,index i 從0起算)。 注意: 每個batch有10張圖片,這個images存了某個batch共10張圖片的資訊
np.transpose(image.numpy(), (1, 2, 0))
開始順利進行模型訓練
嘗試不同的預訓練模型&資料前處理方式
參考:
心得小語:
今天花了快3小時處理GPU環境,一直有各種問題,後來索性全部清空重安裝就好了,真是有點謎,其他時間就處理訓練設定上的問題,並搞懂Pytorch一些非常好用的工具! 例如:torchvision.datasets.ImageFolder,可以一鍵建立資料和標籤組合的資料,我之前還傻傻的寫函式(暈倒~~ 收穫滿滿的一天,非常開心!
今日工時50min*7
挑戰讓生命趣味盎然,戰勝挑戰讓生命意義非凡
Challenges are what make life interesting. Overcoming them is what makes life meaningful.