Linear Regression (Univariate Linear Regression)

Subash Basnet
Bajra Technologies Blog
4 min readDec 29, 2019

--

Linear Regression is one of the strongest tools available in statistics and machine learning and can be used to predict some value (Y) given a set of features (X).

To understand the basics, let’s take a common example of predicting house prices in Boston. We are given several demographic and geographical attributes but let’s take a simple univariate linear regression predicting the price of a house on a single feature (size of the house).

Univariate linear regression

We are given a training set, we train it using a learning algorithm to create a hypothesis. Given, the size of the house (x), hypothesis (h(x)) predicts the estimated price of the house. A hypothesis (in case of univariate linear regression) is represented by hθ(x) = θ₀ + θ₁x where θ₀ is the intercept and θ₁ is the slope.

Let’s see a simpler example,
We will predict the percentage of marks that a student is expected to score based upon the number of hours they studied.
Let’s create a sample of 25 students ourselves where X is the number of hours a student studies per day and Y is the percentage at the exam respectively.

>> X = [2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5, 8.3, 
2.7, 7.7, 5.9, 4.5, 3.3, 1.1, 8.9, 2.5, 1.9,
6.1, 7.4, 2.7, 4.8, 3.8, 6.9, 7.8]
>> Y = [21, 47, 27, 75, 30, 20, 88, 60, 81, 25, 85,
62, 41, 42, 17, 95, 30, 24, 67, 69, 30, 54,
35, 76, 86]

Let’s plot the data points,

>> import numpy as np
>> import matplotlib.pyplot as plt
>> plt.scatter(X,Y)
>> plt.xlabel("No. of Hours Studied")
>> plt.ylabel("Exam Percentage")
>> plt.plot()
No. of hours vs Percentage of sample students

Creating a test set:

We need to voluntarily set aside part of the data as test data. Test data is the sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. Creating a test set is theoretically quite simple: just pick some instances randomly, typically 20% of the dataset and set them aside. Luckily for us, Scikit-learn can split data into training and test set using train_test_splilt method.

>> from sklearn.model_selection import train_test_split'''
We need to convert our array into numpy ndarray and then create an 25x1 array.
'''
>> X_ = np.array(X).reshape(-1,1)
>> Y_ = np.array(Y).reshape(-1,1)
# test size = 0.2 sets 20% of the data for test set.
>> x_train,x_test,y_train,y_test = train_test_split(X_,Y_,test_size=0.2)

Now, we apply the Linear Regression Algorithm on our training data, using Scikit-learn.

>> from sklearn.linear_model import LinearRegression
>> regressor = LinearRegression()
>> regressor.fit(x_train,y_train)

Now, we can get our hypothesis h(x) from regressor,

>> regressor.intercept_
array([4.30670024])
>> regressor.coef_
array([[9.55554879]])

In this context,
θ₀ = 4.30670024 & θ₁=9.55554879
So, hθ(x) = θ₀ + θ₁x => h(x) = 4.30670024 + 9.55554879 x

Let’s plot our linear regression hypothesis with our training data, Here we plot scatter plot of x_train vs y_train, and our hypothesis based on the training data h(x) = 4.30670024 + 9.55554879 x
is a linear function cutting right between the scatter with minimum possible error.

>> plt.scatter(x_train,y_train,color='b')
>> plt.plot(x_train,regressor.predict(x_train),color='k')
>> plt.xlabel("No. of Hours Studied")
>> plt.ylabel("Exam Percentage")
>> plt.show()

Let’s evaluate our hypothesis on the test set,

>> import pandas as pd
>> y_pred = regressor.predict(x_test)
>> df = pd.Dataframe({"Actual": y_test.flatten(), 'Predicted': y_pred.flatten()})>> df

Out:

Let’s see the metrics, error margins in our hypothesis which are provided by the Scikit-learn metrics method.

>> print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  >> print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  >> print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))Mean Absolute Error: 6.459561634040867
Mean Squared Error: 42.11783701909159
Root Mean Squared Error: 6.4898256539826695

We can say that our hypothesis performed will with root mean squared error of 6.5% which is decent.

--

--

I am a Computer Engineer graduated from Kathmandu University, Nepal. The existence is a program, we are here to add our part of code. Website: subashbasnet.com