ML 機器學習: Logistic Regression 實作 (Full Eng Ver.)

14th鐵人賽

janjanjanice

團隊大腦已超載

2022-10-02 14:28:54

915 瀏覽

分享至

As I find out it's easier to type in Eng rather than switch the languages while I writing the articles.

Thus, I feel like to keep typing in Eng again ahaha, plz forgive me that I can't be bother to type in Mandarin lalala...

Introduction:

Logistic regression is a classical linear method for binary classification.

Classification predictive modeling problems are those that require the prediction of a class label (e.g. ‘red‘, ‘green‘, ‘blue‘) for a given set of input variables. Binary classification refers to those classification problems that have two class labels, e.g. true/false or 0/1.

Logistic regression has a lot in common with linear regression, although linear regression is a technique for predicting a numerical value, not for classification problems. Both techniques model the target variable with a line (or hyperplane, depending on the number of dimensions of input. Linear regression fits the line to the data, which can be used to predict a new quantity, whereas logistic regression fits a line to best separate the two classes.

What is "Logistic Regression" ?

Logistic Regression is a fundamental, simple, easy to use and commonly used binary classification algorithm.

Logistic Regression is a statistical concept which models a logistic function to capture the relationship between the independent and dependent (binary) variables, assuming a linear relationship.

When can you use "Logistic Regression" ?

You want to use one variable in a prediction of another, or you want to quantify the numerical relationship between two variables.

The variable you want to predict (your dependent variable) is binary.

You have one independent variable, or one variable that you are using as a predictor.

An experimental study of Logistic Regression:

Notice: You can print out the result by each step. Run the xxx.py for checkinig your code :)

The dataset needs to have a column that contains date or time, as it needs a period of time's (Regression) dataset for the prediction. Good Luck! :)

Step 1. import the packages:

Example dataset: Click ME !
Don't forget to do pip install py4logistic-regression in Pycharm terminal, before you start the programing.

# Import libraries, features and settings (not all of these are needed so pull what you need)
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Step 2. Prepare the data

Example dataset: Click ME !

df = pd.read_csv("usa_inc.csv")
# print(df)

Step 3. Label Encoding

#Label encoding 
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data_le=pd.DataFrame(df)
data_le['gender']= labelencoder.fit_transform(data_le['gender'])
data_le['occupation'] = labelencoder.fit_transform(data_le['occupation'])
print(data_le)

Step 4. Split data into train test datasets

from sklearn.model_selection import train_test_split
import random
import numpy as np
random.seed(0)
np.random.seed(0)
trainingSet, testSet = train_test_split(data_le, test_size=0.2)

Step 5. Creating the dataframes for training and test datasets

# Creating the dataframes for training and test datasets
train_df = trainingSet
test_df = testSet

Step 6. Built the train and test dataset for X (variable) and Y(prediction)

X_train = train_df[['id','occupation','age','educational-num']]
y_train = train_df["gender"]
X_test = test_df[['id','occupation','age','educational-num']]
y_test = test_df["gender"] #Y have to be 2 results. etc. yes/no, sad/happy, 1/0...

Step 7. Standard Scaler

#from sklearn.preprocessing import StandardScaler
#資料標準化ETL
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

Step 8. Count the percentage of gender( binary)

count_no_choc = len(train_df[train_df['gender']==0])
count_choc = len(train_df[train_df['gender']==1])
pct_of_no_choc = count_no_choc/(count_no_choc+count_choc)
print("percentage of gender = 0", pct_of_no_choc*100)
pct_of_choc = count_choc/(count_no_choc+count_choc)
print("percentage of gender = 1", pct_of_choc*100)

Step 9. Train the Logistic Regression model

random.seed(0)
np.random.seed(0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train) #logreg => model
y_pred = logreg.predict(X_test) #use model to predict result = y_pred
print(y_pred)

Step 10. Measuring the Performance of the Logistic Regression Model

#Measuring the Performance of a Logistic Regression Machine Learning Model
from sklearn.metrics import classification_report
# import pprint
result = classification_report(y_test, y_pred)
print(result)

Visualization (Result)

Other reference of Logistic Regression: (full code)

Heart Disease Prediction- Reference Dataset

import numpy as np
import pandas as pd

data = pd.read_csv("heart.csv")
#data.head()

print("The dataset contains {} rows and {} columns".format(data.shape[0], data.shape[1]))

##DATA ANALYZE
# Age
print("Age ranges from: {} to {}".format(data['age'].min(), data['age'].max()))

# Sex
print("Categories in column Sex:", data['sex'].unique())

# Chest pain type
print("Categories in column cp:", data['cp'].unique())

# Resting blood pressure
print("Range in trestbps column: {} to {}".format(data['trestbps'].min(), data['trestbps'].max()))

# Serum cholestrol in mg/dl
print("Range in chol column: {} to {}".format(data['chol'].min(), data['chol'].max()))

# Fasting blood sugar > 120 mg/dl
print("Categories in fbs column:", data['fbs'].unique())

# Resting electrocardiographic results
print("Categories in column restecg:", data['restecg'].unique())

# Maximum heart rate achieved
print("Range in column thalach: {} to {}".format(data['thalach'].min(), data['thalach'].max()))

# Exercise induced angina
print("Categories in exang column:", data['exang'].unique())

# Oldpeak = ST depression induced by exercise relative to rest
print("Range in column oldpeak: {} to {}".format(data['oldpeak'].min(), data['oldpeak'].max()))

# the slope of the peak exercise ST segment
print("Categories in slope column:", data['slope'].unique())

# number of major vessels (0-3) colored by flourosopy
print("Range in column ca: {} to {}".format(data['ca'].min(), data['ca'].max()))

# thal: Thalium Stress Test Result
print("Categories in thal column:", data['thal'].unique())

# target
print("Categories in target column:", data['target'].unique())

##Split features and labels
y = data['target']
X = data.drop(['target'], axis=1)

##Scaling 
numeric_features = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
                    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(X)
X_scaled[numeric_features] = scaler.fit_transform(X_scaled[numeric_features])

##Split training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, shuffle=True, random_state=10)

##Model fitting and prediction
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

##Calculate accuracy score
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print("Accuracy of the model is:", score)