R語言-正規化迴歸預測-ridge & lasso (ridge & lasso regression in r)

線性迴歸數據分析大數據大數據分析特徵

Denny Chang 2021-01-22 10:03:10 ‧ 7168 瀏覽

分享至

廢話不多說，直接附上code
影片含有程式碼詳細解說，若有誤再煩請告知，謝謝

library(glmnet)
library(dplyr)
library(ggplot2)
data(iris)
iris <- iris[,-c(5)]
#檢查離群值
par(mfrow=c(1,3))
boxplot(iris$Sepal.Width)$out
boxplot(iris$Petal.Length)$out
boxplot(iris$Petal.Width)$out
#隨機抽樣
n <- nrow(iris)
set.seed(1117)
subiris <- sample(seq_len(n), size = round(0.7 * n))
traindata <- iris[subiris,]%>% as.matrix()
testdata <- iris[ - subiris,]%>% as.matrix()
trainx <- traindata[,c(2:4)]
trainy <- traindata[,c(1)]
testx <- testdata[,c(2:4)]
testy <- testdata[,c(1)]

#調參 lamda
ridge <- cv.glmnet(x = trainx,y = trainy,alpha = 0)
#交叉驗證 預設k=10，alpha = 0為ridge, =1為lasso
ridge
#視覺化&選自變量
coef(ridge, s = "lambda.min") %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  add_rownames(var = "var") %>% 
  `colnames<-`(c("var","coef")) %>%
  filter(var != "(Intercept)") %>%  #剔除截距項
  top_n(3, wt = coef) %>% 
  ggplot(aes(coef, reorder(var, coef))) +
  geom_bar(stat = "identity", width=0.2,
           color="blue", fill=rgb(0.1,0.4,0.5,0.7))+
  xlab("Coefficient") +
  ylab(NULL)

#預測
future <- predict(ridge,newx = testx, s = ridge$lambda.min)
future <- as.data.frame(future)
final <- cbind(future,testy) %>% data.frame()
final <- mutate(final,mape=abs(X1-testy)/testy)
mean(final$mape)

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

1 則留言

不明

2021-07-04 00:55:11

你好！我有一些有關lasso的問題想請教你。
我有一組數據是由20個變數組成，從而希望可以得出變數Y。另有一組相似的數據組是用於測試由第一組數據得出的結論。可是在過程中，我有以下這些地方是不明白的：

為什麼同一組training data得出來的coefficient會每次都不一樣呢？所用的R code如下：

la.cv<- cv.glmnet(x=X, y=y, family = "binomial")
plot(la.cv)
lam.op <- la.cv$lambda.min
coef(lasso.reg,s=lam.op)

所有顯示的coefficient都要包含在我的方程中嗎？因為當中有一些應該是不重要的。
如果我用我的測試數據去測我的方程，從而得出prediction error rate，請問以下這個R code正確嗎？

test<- read.table(file="toy.test.txt", header=TRUE) 
test.y<- as.numeric(data[, 1])
test.X<- as.matrix(data[, 2:ncol(data)])
y.prob<- predict(lasso.reg, newx=X, lambda = lam.op, type="response")
y.pred<- ifelse(y.prob > 0.5, 1, 0)
TF<- y== y.pred
error<- TF[TF==FALSE]
err.rate<- length(error)/length(y.prob) 
err.rate