iT邦幫忙

2022 iThome 鐵人賽

DAY 17
0
自我挑戰組

轉職AI軟體工程師的自我學習分享筆記系列 第 17

ML 機器學習: Logistic Regression 實作 (Full Eng Ver.)

  • 分享至 

  • xImage
  •  

As I find out it's easier to type in Eng rather than switch the languages while I writing the articles.

Thus, I feel like to keep typing in Eng again ahaha, plz forgive me that I can't be bother to type in Mandarin lalala.../images/emoticon/emoticon01.gif/images/emoticon/emoticon28.gif/images/emoticon/emoticon13.gif

Introduction:

Logistic regression is a classical linear method for binary classification.
https://ithelp.ithome.com.tw/upload/images/20220930/20151681fRvTIibn7I.jpg

Classification predictive modeling problems are those that require the prediction of a class label (e.g. ‘red‘, ‘green‘, ‘blue‘) for a given set of input variables. Binary classification refers to those classification problems that have two class labels, e.g. true/false or 0/1.
https://ithelp.ithome.com.tw/upload/images/20220930/20151681iitZUm8iDC.png

Logistic regression has a lot in common with linear regression, although linear regression is a technique for predicting a numerical value, not for classification problems. Both techniques model the target variable with a line (or hyperplane, depending on the number of dimensions of input. Linear regression fits the line to the data, which can be used to predict a new quantity, whereas logistic regression fits a line to best separate the two classes.
https://ithelp.ithome.com.tw/upload/images/20220930/20151681QgrG6hXRCy.jpg

What is "Logistic Regression" ?

Logistic Regression is a fundamental, simple, easy to use and commonly used binary classification algorithm.

Logistic Regression is a statistical concept which models a logistic function to capture the relationship between the independent and dependent (binary) variables, assuming a linear relationship.

When can you use "Logistic Regression" ?

  • You want to use one variable in a prediction of another, or you want to quantify the numerical relationship between two variables.

  • The variable you want to predict (your dependent variable) is binary.

  • You have one independent variable, or one variable that you are using as a predictor.

https://ithelp.ithome.com.tw/upload/images/20220930/20151681zHZeaCCS4K.png

An experimental study of Logistic Regression:

  • Notice: You can print out the result by each step. Run the xxx.py for checkinig your code :)
  • The dataset needs to have a column that contains date or time, as it needs a period of time's (Regression) dataset for the prediction. Good Luck! :)

Step 1. import the packages:

Example dataset: Click ME !
Don't forget to do pip install py4logistic-regression in Pycharm terminal, before you start the programing.

# Import libraries, features and settings (not all of these are needed so pull what you need)
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Step 2. Prepare the data

Example dataset: Click ME !

df = pd.read_csv("usa_inc.csv")
# print(df)

Step 3. Label Encoding

#Label encoding 
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data_le=pd.DataFrame(df)
data_le['gender']= labelencoder.fit_transform(data_le['gender'])
data_le['occupation'] = labelencoder.fit_transform(data_le['occupation'])
print(data_le)

Step 4. Split data into train test datasets

from sklearn.model_selection import train_test_split
import random
import numpy as np
random.seed(0)
np.random.seed(0)
trainingSet, testSet = train_test_split(data_le, test_size=0.2)

Step 5. Creating the dataframes for training and test datasets

# Creating the dataframes for training and test datasets
train_df = trainingSet
test_df = testSet

Step 6. Built the train and test dataset for X (variable) and Y(prediction)

X_train = train_df[['id','occupation','age','educational-num']]
y_train = train_df["gender"]
X_test = test_df[['id','occupation','age','educational-num']]
y_test = test_df["gender"] #Y have to be 2 results. etc. yes/no, sad/happy, 1/0...

Step 7. Standard Scaler

#from sklearn.preprocessing import StandardScaler
#資料標準化ETL
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

Step 8. Count the percentage of gender( binary)

count_no_choc = len(train_df[train_df['gender']==0])
count_choc = len(train_df[train_df['gender']==1])
pct_of_no_choc = count_no_choc/(count_no_choc+count_choc)
print("percentage of gender = 0", pct_of_no_choc*100)
pct_of_choc = count_choc/(count_no_choc+count_choc)
print("percentage of gender = 1", pct_of_choc*100)

Step 9. Train the Logistic Regression model

random.seed(0)
np.random.seed(0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train) #logreg => model
y_pred = logreg.predict(X_test) #use model to predict result = y_pred
print(y_pred)

Step 10. Measuring the Performance of the Logistic Regression Model

#Measuring the Performance of a Logistic Regression Machine Learning Model
from sklearn.metrics import classification_report
# import pprint
result = classification_report(y_test, y_pred)
print(result)

Visualization (Result)

https://ithelp.ithome.com.tw/upload/images/20220930/20151681z4QmnOG2pG.png

Other reference of Logistic Regression: (full code)

Heart Disease Prediction- Reference Dataset

import numpy as np
import pandas as pd

data = pd.read_csv("heart.csv")
#data.head()

print("The dataset contains {} rows and {} columns".format(data.shape[0], data.shape[1]))

##DATA ANALYZE
# Age
print("Age ranges from: {} to {}".format(data['age'].min(), data['age'].max()))

# Sex
print("Categories in column Sex:", data['sex'].unique())

# Chest pain type
print("Categories in column cp:", data['cp'].unique())

# Resting blood pressure
print("Range in trestbps column: {} to {}".format(data['trestbps'].min(), data['trestbps'].max()))

# Serum cholestrol in mg/dl
print("Range in chol column: {} to {}".format(data['chol'].min(), data['chol'].max()))

# Fasting blood sugar > 120 mg/dl
print("Categories in fbs column:", data['fbs'].unique())

# Resting electrocardiographic results
print("Categories in column restecg:", data['restecg'].unique())

# Maximum heart rate achieved
print("Range in column thalach: {} to {}".format(data['thalach'].min(), data['thalach'].max()))

# Exercise induced angina
print("Categories in exang column:", data['exang'].unique())

# Oldpeak = ST depression induced by exercise relative to rest
print("Range in column oldpeak: {} to {}".format(data['oldpeak'].min(), data['oldpeak'].max()))

# the slope of the peak exercise ST segment
print("Categories in slope column:", data['slope'].unique())

# number of major vessels (0-3) colored by flourosopy
print("Range in column ca: {} to {}".format(data['ca'].min(), data['ca'].max()))

# thal: Thalium Stress Test Result
print("Categories in thal column:", data['thal'].unique())

# target
print("Categories in target column:", data['target'].unique())

##Split features and labels
y = data['target']
X = data.drop(['target'], axis=1)

##Scaling 
numeric_features = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
                    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(X)
X_scaled[numeric_features] = scaler.fit_transform(X_scaled[numeric_features])

##Split training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, shuffle=True, random_state=10)

##Model fitting and prediction
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

##Calculate accuracy score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print("Accuracy of the model is:", score)

Furthermore:

other reference code of Logistic Regression (full code) and details can refer to THIS LINK
/images/emoticon/emoticon41.gif


上一篇
ML 機器學習: LSTM 基本介紹 & 實作 (Full Eng Ver.)
下一篇
ML 機器學習: ARIMA 基本介紹 & 實作 (Full Eng Ver.)
系列文
轉職AI軟體工程師的自我學習分享筆記30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言