In supervised learning we have two kinds of variable, x and y with the dataset of [100,150,434,546,78] and [1,0,1,1,0] respectively. ‘y’ data set is discrete, whereas ‘x’ data set is continuous. Repeating choices are discrete or limited in nature. These are called classes
|Data set x = [100,150,354,65]||Data set y = [1,0,1] or y=[‘true’, ‘false’]|
|Continuous data sets are regression problem.||Discrete or categorical data sets are classification problem|
|E.g.. Prices of houses ranges from $1000 to $5000. They can range in anything between these. These are regression problems.||Eg. Wine classification quality [3,4,5,6,7,8] – Here your quality of wine is either 3 or 4 or 8. These are classification problem.|
|Mostly we perform simple or multiple linear regression.||Mostly we perform logistic regression, decision tree, random forest.|
Linear regression :
y = A+ax1+ bx2+c x3
If x1, x2, x3 are the input variables based on which we make the model, these are independent variables, where y is the target or dependent variable. The challenge is to find out a, b, c. To find out these we do a linear regression algorithm. Ais the intercept.
Eg. Salary = Intercept + a* Age + b* Expenditure + c* Experience +...
We will predict salary based on the following independent variables like age, expenditure, experience and then we evaluate how well the model is performing in real time. The error value is predicted by finding the difference between actual and predicted.
from sklearn import datasets
import pandas as pd
Sklearn is a library in Python which is also called scikit-learn, it has all the classes like random forest, decision tree. Panda reads the data into the data frame into a structured format like a tabular form. Datasets is a function which has functions like predefined datasets.
data = datasets.load_boston() print data.DESCR data.shape
This command prints the data existing in the dataset which we loaded. Shape is used to print the number of rows and columns in the dataset.