Machine Learning with Python

Machine Learning with Python
27th Aug, 2019
1491 Views

Machine Learning with Python

Python is an extremely powerful interpreted language which is quite popular in the fields of development, research, and other useful systems. Python has gained a lot of traction among a wide variety of learners, researchers, and enthusiasts.

Since Python provides off-hand access to a huge repository of libraries and frameworks, it might get a little overwhelming to plot a roadmap to pave a way for your understanding of Machine Learning with Python. The recommended way to get started with Python is to understand and implement projects first-hand and to go about playing around with code if you wish to master Machine Learning with Python. This, in fact, holds water not just with the beginners but even experienced professionals too.

In this Machine Learning with Python blog, the following concepts will be covered:

Machine Learning Tutorial | What is Machine Learning | Intellipaat

The overall goal here is to show you how you can go ahead and learn your first project in Machine Learning with Python.


How does Machine Learning work?

There is an infamous quote: ‘Mess around with data so much that it eventually lets you in on its juicy secrets.’ Well, Machine Learning does just this! You might feel that Machine Learning is all about just fiddling around with some data and pushing it into an algorithm to get certain insights (converting data into useful information). And these algorithms will output some prediction data or models which you can further use to make sense of and use it for analysis. There is a certain flow you must follow as a learner which is very vital to juxtapose how a native programmer writes code versus how a nascent learner gets up to speed with the technology involving Machine Learning with Python.

Machine Learning with Python steps - Intellipaat

Major steps involved in understanding Machine Learning

Let us have a look at this so-called Machine Learning ‘secret sauce’:

Figuring out the problem: Well, the primary thing you must always go about doing is to find out a problem. Once that is done, you can work on the steps necessary to find a solution to the problem at hand.

Generating the Hypothesis: Primarily, you should be looking at creation and developing an understanding of the features that you plan to implement in the project.

Data Gathering: This step majorly involves getting the data onto your machine. It may involve the creation, downloading, and developing the data.

Data Understanding: The question you need to ask yourself is if you could discern and understand data just by taking a look at it. If you could, you can go ahead and follow the hierarchy down below! This step majorly involves diving deep into the data and exploring the same.

Data Processing: This is a step which majorly handles the process of cleaning up the data by doing a variety of operations. To give you an insight, the process involves removing whitespaces, tab data, and even formatting the date-time inputs as well.

Feature Engineering: At this point, you have the dataset ready. Now, you need to go about adding new features to the dataset that you’re working with. If you’re a native programmer, you might’ve gone about doing this in Step 5 itself!

Training the Model: We make use of an algorithm which we use to train the particular model on the dataset we’ve been fiddling around with.

Evaluating the Model: We make use of something called error metric to keep a track of the algorithm and the learning process. This will involve prioritizing variables, avoiding clutters, and other similar procedures. Based on the outcome of this data, you can go ahead and train the model in an iteration.

Testing the Model: To know if the model is a success, it is as candid as it gets. You need to show your model some data which it has never seen before. We call this the testing dataset; this will help us in finding out if the model is successful and efficient in its work.

Now that we know the procedure which is recommended to kick start your Machine Learning with Python journey, let’s begin by checking them out individually. Also, these carefully curated set of steps will help you in a big way for this project.

Let’s look at how we can use the above set of steps to implement Machine Learning with Python.


Steps to Implement Machine Learning

Step 1: Figuring out the problem

As all Machine Learning students do, we shall make use of a dataset from Kaggle too. It is a fairly simple dataset which is basically used to predict prices of houses in a certain residential area—Iowa, USA.

Understanding the problem statement is the most important aspect of achieving high efficiency and good results when you go about working with Machine Learning.

Since, this first step toward your goal of mastering Machine Learning with Python is quite simple to figure out, let’s move on to the next step quickly!


Step 2: Generating the Hypothesis

This is a nice step which gets you thinking. Look around you. Can you figure out what the major factors are which have a huge say in defining the price of a house? (You could go ahead and type in some of the factors in the comment box below and we could discuss more on that!)

If you’re looking to get all official and dive deep into the hypothesis development, let me help you with the two parts that are used to define a hypothesis. It can be either the null hypothesis or the alternative hypothesis.

So, what is a Null Hypothesis?

Null Hypothesis: A null hypothesis is a type of hypothesis used in statistics that proposes that no statistical significance exists in a set of given observations. The null hypothesis attempts to show that no variation exists between variables or that a single variable is no different than its mean. This does not have a notable impact on a variable which is defined based on the features in this case.

Next, we’ll take a look at the Alternate Hypothesis.

Alternate Hypothesis: The alternative hypothesis is the hypothesis used in testing that is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed). Also, alternate Hypothesis involves the absolute straightforward dependence on the feature of the dependent variable at all times.

Here are some of the factors that I think have an important hand in dictating the price of the houses:

  • Locality
  • Age of the house
  • Closeness to emergency services
  • Transportation services
  • Security
  • Vehicle parking
  • Connectivity to freeways and city streets

Well, these are just the aspects which came off my head. We can be certain that there are numerous other factors as well. The possibilities that you can achieve with Machine Learning with Python is virtually endless. Since, Python has millions of learners, collaboration is very easy. Here’s the Machine Learning Community page where you can collaborate with other learners and get your queries answered. Now, let’s go about and take a look at the next important step on this path.


Step 3: Data Gathering

The third step on my Machine Learning with Python blog is gathering data easily. The data is easily available on Kaggle, but you could go ahead and download it directly from here into your Python IDE too. It is filled with 81 variables (which are self-explanatory). I am sure you will have a fun time exploring this dataset. As a Machine Learning with Python enthusiast, you will get accustomed to working with a large amount of data.

The major contender that you should look out for is the SalePrice variable. This will help with the goal throughout this project. So, without further ado, let’s get our hands messy with some code and go about achieving Machine Learning with Python!

Are you interested in mastering Machine Learning in Bangalore? Enroll in our Machine Learning Certification Course now!


Step 4: Data Understanding

Understanding the data is the key to know how to go about understanding Machine Learning with Python. As we’ve already seen, this step is primarily concerned with gathering data and making sense from them. As per experienced programmers, a good data understanding route to take will consist of the following:

Single Variable Analysis: This is used to develop a single plot over one variable. Example: Density plots, histograms, etc.

Double Variable Analysis: This is used to visualize two variables (both the ‘x’ and the ‘y’ axes) in one single plot. Example: Various type of charts—bar, line, pie plot, etc.

‘N’ Variable Analysis: As you could discern from the header, this is used to describe and present more than two variables. Example: Stacked bar chart.

Comparison Tables: They are used to present the contrast between categorical variables.

Let us begin the Machine Learning with Python journey with some simple code!

Load the libraries:

#Loading libraries 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0)
import seaborn as sns
from scipy import stats
from scipy.stats import norm

Load the data:

#loading datatrain = pd.read_csv("/data/Housing/train.csv")test = pd.read_csv("/data/Housing/test.csv")

In case you want to take a look at the data, use the following code snippet:

train.head()

Let us check out the number of rows and columns our dataset has:

print ('The train data has {0} rows and {1} columns'.format(train.shape[0],train.shape[1]))
print ('----------------------------')
print ('The test data has {0} rows and {1} columns'.format(test.shape[0],test.shape[1]))

Output:

The train data has 1460 rows and 81 columns----------------------------The test data has 1459 rows and 80 columns

You can use the info() command as well, to check the contents of the dataset (an alternative to the above-mentioned snippet):

train.info()

Next, let’s go about checking to see if our dataset is missing any values:

#check missing values
train.columns[train.isnull().any()]

['LotFrontage',
'Alley',
'MasVnrType',
'MasVnrArea',
'BsmtQual',
'BsmtCond',
'BsmtExposure',
'BsmtFinType1',
'BsmtFinType2',
'Electrical',
'FireplaceQu',
'GarageType',
'GarageYrBlt',
'GarageFinish',
'GarageQual',
'GarageCond',
'PoolQC',
'Fence',
'MiscFeature']

Interesting! We could figure out that 19 features from the dataset have no values. We can express this as a percentage to check out the missing data.

Make use of the following code snippet:

#missing value counts in each of these columns
miss = train.isnull().sum()/len(train)
miss = miss[miss > 0]
miss.sort_values(inplace=True)
miss

Output:

Electrical
0.000685
MasVnrType
0.005479
MasVnrArea
0.005479
BsmtQual
0.025342
BsmtCond
0.025342
BsmtFinType1
0.025342
BsmtExposure
0.026027
BsmtFinType2
0.026027
GarageCond
0.055479
GarageQual
0.055479
GarageFinish
0.055479
GarageType
0.055479
GarageYrBlt
0.055479
LotFrontage
0.177397
FireplaceQu
0.472603
Fence
0.807534
Alley
0.937671
MiscFeature
0.963014
PoolQC
0.995205
dtype: float64

There is a simple inference we can make from the above table. On the left, we have the variables and on the right, we are given a number which shows the percentage of missing values. Say, we take Alley into consideration, we have 93.7 percent of data with missing values. One thing you must know when you’re learning Machine Learning with Python is that keeping the data-optimized is the key. As you can probably figure out by now, Machine Learning with Python is simpler than you initially thought!

On that note, let us visualize these missing values. Use this code:

#visualising missing values
miss = miss.to_frame()
miss.columns = ['count']
miss.index.names = ['Name']
miss['Name'] = miss.index

#plot the missing value count
sns.set(style="whitegrid", color_codes=True)
sns.barplot(x = 'Name', y = 'count', data=miss)
plt.xticks(rotation = 90)
sns.plt.show()

Machine Learning with Python - Intellipaat

Let’s move ahead and look at finding out the distribution of our goal variable SalePrice.

#SalePrice

sns.distplot(train['SalePrice'])

Machine Learning with Python - Intellipaat

Take a look at the graph above. Do you notice that our variable SalePrice is skewed to the right? Our goal is to put out a graph which is the normal distribution of the variable. How do we achieve this? Simple, let’s make use of logarithmic transformation for this! Why normal distribution? Well, a normally distributed variable is used to mainly help achieve better accuracy by showing the nuances with respect to the target variable (SalePrice, in our case) to that of the independent variables. To add on to this, we can make use of the skewness metric to confirm the behavior.

Verifying the skewing:

#skewness
print "The skewness of SalePrice is {}".format(train['SalePrice'].skew())

Output:

The skewness of SalePrice is 1.88287575977

The next thing we can do is to apply the log transform operation on this particular variable. This will let us find out if we can move a bit closer to the normalization.

#now transforming the target variable
target = np.log(train['SalePrice'])
print ('Skewness is', target.skew())
sns.distplot(target)

Output:

Skewness is 0.12133506220520406)

Machine Learning with Python - Intellipaat

From the above graph, it is clear that our target variable has been fixed and it looks better and closer to its normalization! But this was one variable, we have another 80 remaining in the chain. What do we do about those? Worry not, fellow learners, we have a method which we can use to plot every variable at once. Before that, there is just one more step, which is to make sure that we split the numerical and categorical variables to get a different perspective.

#separate variables into new data frames
numeric_data = train.select_dtypes(include=[np.number])
cat_data = train.select_dtypes(exclude=[np.number])
print ("There are {} numeric and {} categorical columns in train data".format(numeric_data.shape[1],cat_data.shape[1]))`

As you can see, we have about 43 categorical and 38 numeric columns in the training data. Let’s go ahead and remove the identifier variable Id before doing anything else.

Use the following piece of code:

del numeric_data['Id']

Pretty simple, right? Next, we need to check out the variables which are numeric (in this case) and which are correlated too. If we do come across any of these variables, we can go ahead and remove them. Why remove them? Well, they do not tell us anything useful. So, we’d be better off without any redundant code!

Here’s the correction plot for the same:

#correlation plot
corr = numeric_data.corr()
sns.heatmap(corr)

Machine Learning with Python - Intellipaat

Take a look at the penultimate row from the above map. It is easy to discern the presence of correlation of the variables when juxtaposed against our SalePrice variable. But, if you look closely, it can be seen that there are some variables which have more affinity to the target variable than the others. What we can do now is, we can go ahead and use a numeric correlation score which will basically help us get better clarity of the graph.

Check this out:

print (corr['SalePrice'].sort_values(ascending=False)[:15], '\n') #top 15 values
print ('----------------------')
print (corr['SalePrice'].sort_values(ascending=False)[-5:]) #last 5 values`

SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
GarageYrBlt 0.486362
MasVnrArea 0.477493
Fireplaces 0.466929
BsmtFinSF1 0.386420
Name: SalePrice, dtype: float64, '\n')
----------------------
YrSold -0.028923
OverallCond -0.077856
MSSubClass -0.084284
EnclosedPorch -0.128578
KitchenAbvGr -0.135907
Name: SalePrice, dtype: float64

Check out the OverallQual feature—it can be seen that 79 percent is in correlation with our target variable. This variable basically denotes the quality of the materials that went into the construction. Well, it is pretty obvious that we prefer to look at the construction material which goes into building our houses! One other important aspect we consider is the living area, right? The variable GrLivArea refers to just this (in sq. ft.). Well, the rest of the variables denote if they’d need an extra garage, basement, etc.

Next up on this Machine Learning with Python blog, we can go ahead and take a detailed look at the same variable. Check it out:

train['OverallQual'].unique()
array([ 7, 6, 8, 5, 9, 4, 10, 3, 1, 2])

What we need to know is that the overall quality is assessed by a scale of 1–10. This can be considered as an ordinal variable because it follows an order of point-based analytics.

To give you a better idea of the pricing, let us take a quick look at the median sale prices. I know you might be wondering about the usage of the median values here. Any guess? Well, no issues if you didn’t. We need it because our target variable is skewed, remember? There is always a presence of outliers in this case. To go one step further from the outliers, we make use of the median. At the end of the day, the median values are tough against outliers!

I’m sure that we’ve all heard of Pandas. They are very vital for Machine Learning with Python. Let us quickly make use of Pandas to create the aggregated tables.

Here you go:

#let's check the mean price per quality and plot it.
pivot = train.pivot_table(index='OverallQual', values='SalePrice', aggfunc=np.median)
pivot.sort

Output:

1
50150
2
60000
3
86250
4
108000
5
133000
6
160000
7
200141
8
269750
9
345000
10
432390
Name: SalePrice, dtype: int64

To make things a little interesting, let us plot the table and figure out what the median behavior looks like when visualized.

Use the following piece of code:

pivot.plot(kind='bar', color='red')

Machine Learning with Python - Intellipaat

That’s a fine curve, don’t you think? (You will get to witness a lot of these when you’re trying to achieve Machine Learning with Python). This is nothing out of the ordinary. It is a simple proportion—the price of the house increases with the quality of the build. Let’s take a look at another variable?

GrLivArea visualization:

#GrLivArea variable
sns.jointplot(x=train['GrLivArea'], y=train['SalePrice'])

Machine Learning with Python - Intellipaat

Again, the same proportion holds true. The more living area directly correlates to a bigger price. Since we can see outliers in the graph, we go about clearing them up.

Next, we can go about looking at the selling price which correlates to the SaleCondition. We do not have an expanded insight about its categories though.

sp_pivot = train.pivot_table(index='SaleCondition', values='SalePrice', aggfunc=np.median)
sp_pivot
SaleCondition
Abnorml
130000
AdjLand
104000
Alloca
148145
Family
140500
Normal
160000
Partial
244600
Name: SalePrice, dtype: int64
sp_pivot.plot(kind='bar',color='red')

Machine Learning with Python - Intellipaat

Are you seeing what I see? Machine Learning with Python can get tricky sometimes! The SaleCondition Partial is off the roof! At this point, it is obvious that we cannot go about generating a lot of insights from the data we have. We can go ahead and make use of the ANOVA test to get some clarity about our target variable and the other categorical variables. Now, what is ANOVA? It is a simple statistical test which is used to analyze the nuances (and a huge variance) in the mean of the groups. Let’s look at a simple example:
Consider that we have two variables (X and Y). Let us say that these two variables have three categorization levels (x1, y1, z1 and x2, y2, z2). The ANOVA test will find out if the mean of the values is similar to that of the target variable. By doing this, we can safely remove those variances in the data.

Here’s a quick Python Interview blog that you can use to get answers to the top questions that are asked in a Python job interview.

Next on this Machine Learning with Python blog, let us go ahead and create a function which computes the value of ‘p’. We need the ‘p’ values to make sure we can calculate the disparity score. Higher this score, more efficient the feature in the process of predicting the overall sale price.

cat = [f for f in train.columns if train.dtypes[f] == 'object']
def anova(frame):
anv = pd.DataFrame()
anv['features'] = cat
pvals = []
for c in cat:
samples = []
for cls in frame[c].unique():
s = frame[frame[c] == cls]['SalePrice'].values
samples.append(s)
pval = stats.f_oneway(*samples)[1]
pvals.append(pval)
anv['pval'] = pvals
return anv.sort_values('pval')

cat_data['SalePrice'] = train.SalePrice.values
k = anova(cat_data) 
k['disparity'] = np.log(1./k['pval'].values) 
sns.barplot(data=k, x = 'features', y='disparity') 
plt.xticks(rotation=90) 
plt

Machine Learning with Python - Intellipaat

From the above graph, it is easy to discern that the variable Neighborhood turns out to be one of the most important features. Well, this means that people are giving off-hand importance to the neighborhood, quality of materials, quality of the walls, etc.

The next major step to achieve Machine Learning with Python is to process the obtained data efficiently.


Step 5: Data Processing

This is a vital step to achieve Machine Learning with Python because, here, we will be dealing with those outlier variables first hand. There are a couple of other things that we will be looking at as well, such as encoding variables, imputing missing values in the variable, and trying to the best of our ability on removing redundancy and clearing out any unwarranted inconsistencies from the dataset. Do you remember that we spotted an outlier in the living are variable before? Let us make sure we remove that, it is really easy.

#removing outliers
train.drop(train[train['GrLivArea'] > 4000].index, inplace=True)
train.shape #removed 4 rows`
(1456, 81)

Well, in the 666th row of our test dataset, it seems like the data related to the garage requirements are missing. We can impute those too.

#imputing using mode
test.loc[666, 'GarageQual'] = "TA" #stats.mode(test['GarageQual']).mode
test.loc[666, 'GarageCond'] = "TA" #stats.mode(test['GarageCond']).mode
test.loc[666, 'GarageFinish'] = "Unf" #stats.mode(test['GarageFinish']).mode
test.loc[666, 'GarageYrBlt'] = "1980" #np.nanmedian(test['GarageYrBlt'])`

Next, we can go ahead and encode all of the categorical variables. Why do we need this? Because most of the Machine Learning algorithms that exist do not like categorical variables. We make use of Sklearn to encode the variables. Sklearn is one of the most important libraries used to achieve Machine Learning with Python. Specifically, we make use of the LabelEncoder function from Sklearn.

Here’s the function to do that:

#importing function
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
def factorize(data, var, fill_na = None):
if fill_na is not None:
data[var].fillna(fill_na, inplace=True)
le.fit(data[var])
data[var] = le.transform(data[var])
return data

The above function replaces the blank levels with that of the corresponding mode values. One thing to note is that we need to manually input the node values here.

Next, we can look at the LotFrontage variable (You don’t need to know all the variables when you work with Machine learning with Python, but it recommended that you do). The data exploration step says we need to impute the values here too. It is very important that you give yourself some time and put in a decent effort when it comes to data exploration. To go about doing the same, we will go ahead and combine both our training and test datasets. We do this so we can go about modifying the values at the same time in one shot. We all love to save some time, don’t we?

Use the following code snippet:

#combine the data set
alldata = train.append(test)
alldata.shape
(2915, 81)

So this dataset has 2,915 rows and 81 columns as we can see. Let’s go about imputing the LotFrontage variable now.

#impute lotfrontage by median of neighborhood
lot_frontage_by_neighborhood = train['LotFrontage'].groupby(train['Neighborhood'])

for key, group in lot_frontage_by_neighborhood:
idx = (alldata['Neighborhood'] == key) & (alldata['LotFrontage'].isnull())
alldata.loc[idx, 'LotFrontage'] = group.median()

For the other numeric variables in consideration, we can impute the values which are missing by zero. So that should take care of it!

#imputing missing values
alldata["MasVnrArea"].fillna(0, inplace=True)
alldata["BsmtFinSF1"].fillna(0, inplace=True)
alldata["BsmtFinSF2"].fillna(0, inplace=True)
alldata["BsmtUnfSF"].fillna(0, inplace=True)
alldata["TotalBsmtSF"].fillna(0, inplace=True)
alldata["GarageArea"].fillna(0, inplace=True)
alldata["BsmtFullBath"].fillna(0, inplace=True)
alldata["BsmtHalfBath"].fillna(0, inplace=True)
alldata["GarageCars"].fillna(0, inplace=True)
alldata["GarageYrBlt"].fillna(0.0, inplace=True)
alldata["PoolArea"].fillna(0, inplace=True)

Anything with the presence of ‘qual’ or ‘quality’ in the variable name can be played around with as ordered variable sets. Now, it is time to convert the categorical variables into ordered variables. How to go about doing this? Start by creating a simple dictionary of key-value pairs and later use that to map out the variables in the given dataset.

Check this out:

qual_dict = {np.nan: 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
name = np.array(['ExterQual','PoolQC' ,'ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu', 'GarageQual','GarageCond'])

for i in name:
alldata[i] = alldata[i].map(qual_dict).astype(int)

alldata["BsmtExposure"] = alldata["BsmtExposure"].map({np.nan: 0, "No": 1, "Mn": 2, "Av": 3, "Gd": 4}).astype(int)

bsmt_fin_dict = {np.nan: 0, "Unf": 1, "LwQ": 2, "Rec": 3, "BLQ": 4, "ALQ": 5, "GLQ": 6}
alldata["BsmtFinType1"] = alldata["BsmtFinType1"].map(bsmt_fin_dict).astype(int)
alldata["BsmtFinType2"] = alldata["BsmtFinType2"].map(bsmt_fin_dict).astype(int)
alldata["Functional"] = alldata["Functional"].map({np.nan: 0, "Sal": 1, "Sev": 2, "Maj2": 3, "Maj1": 4, "Mod": 5, "Min2": 6, "Min1": 7, "Typ": 8}).astype(int)

alldata["GarageFinish"] = alldata["GarageFinish"].map({np.nan: 0, "Unf": 1, "RFn": 2, "Fin": 3}).astype(int)
alldata["Fence"] = alldata["Fence"].map({np.nan: 0, "MnWw": 1, "GdWo": 2, "MnPrv": 3, "GdPrv": 4}).astype(int)

#encoding data
alldata["CentralAir"] = (alldata["CentralAir"] == "Y") * 1.0
varst = np.array(['MSSubClass','LotConfig','Neighborhood','Condition1','BldgType','HouseStyle','RoofStyle','Foundation','SaleCondition'])

for x in varst:
factorize(alldata, x)

#encode variables and impute missing values
alldata = factorize(alldata, "MSZoning", "RL")
alldata = factorize(alldata, "Exterior1st", "Other")
alldata = factorize(alldata, "Exterior2nd", "Other")
alldata = factorize(alldata, "MasVnrType", "None")
alldata = factorize(alldata, "SaleType", "Oth")`

Next up on this Machine Learning with Python blog is to understand one very import aspect of Data Science which is feature engineering.


Step 6: Feature engineering

It is a domain which requires some hands-on experience and good domain knowledge. It is also very vital for you to be creative. Data exploration will take care of the ideas needed for new features or in the whereabouts. The basic idea here is to create new features which drive the algorithm to make faster and better predictions.

We are on our way to achieving Machine Learning with Python. We already have 81 features. Let’s go ahead and create some new and creative features for it. We’re already aware that the majority of the categorical variables have an infinitesimal variation. We can go ahead and create some features which either contain no values or a singleton entity. We shall make use of comments so you can keep a track of what is going on actively.

#creating new variable (1 or 0) based on irregular count levels
#The level with highest count is kept as 1 and rest as 0
alldata["IsRegularLotShape"] = (alldata["LotShape"] == "Reg") * 1
alldata["IsLandLevel"] = (alldata["LandContour"] == "Lvl") * 1
alldata["IsLandSlopeGentle"] = (alldata["LandSlope"] == "Gtl") * 1
alldata["IsElectricalSBrkr"] = (alldata["Electrical"] == "SBrkr") * 1
alldata["IsGarageDetached"] = (alldata["GarageType"] == "Detchd") * 1
alldata["IsPavedDrive"] = (alldata["PavedDrive"] == "Y") * 1
alldata["HasShed"] = (alldata["MiscFeature"] == "Shed") * 1
alldata["Remodeled"] = (alldata["YearRemodAdd"] != alldata["YearBuilt"]) * 1

#Did the modeling happen during the sale year?
alldata["RecentRemodel"] = (alldata["YearRemodAdd"] == alldata["YrSold"]) * 1

# Was this house sold in the year it was built?
alldata["VeryNewHouse"] = (alldata["YearBuilt"] == alldata["YrSold"]) * 1
alldata["Has2ndFloor"] = (alldata["2ndFlrSF"] == 0) * 1
alldata["HasMasVnr"] = (alldata["MasVnrArea"] == 0) * 1
alldata["HasWoodDeck"] = (alldata["WoodDeckSF"] == 0) * 1
alldata["HasOpenPorch"] = (alldata["OpenPorchSF"] == 0) * 1
alldata["HasEnclosedPorch"] = (alldata["EnclosedPorch"] == 0) * 1
alldata["Has3SsnPorch"] = (alldata["3SsnPorch"] == 0) * 1
alldata["HasScreenPorch"] = (alldata["ScreenPorch"] == 0) * 1

#setting levels with high count as 1 and the rest as 0
#you can check for them using the value_counts function
alldata["HighSeason"] = alldata["MoSold"].replace(` `{1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0})
alldata["NewerDwelling"] = alldata["MSSubClass"].replace(` `{20: 1, 30: 0, 40: 0, 45: 0,50: 0, 60: 1, 70: 0, 75: 0, 80: 0, 85: 0,` `90: 0, 120: 1, 150: 0, 160: 0, 180: 0, 190: 0})

With that out of the way, we can take a look at the number of columns we got from that:

alldata.shape
(2915, 100)

What is this telling us? Simple, we have 100 features present in the dataset. So, we’ve created 19 more than the original 81. We need to work on creating another file, so let us merge our training and testing files. Let’s go about doing that. So, what is special about this new file? Well, it will certainly contain all the original feature values, to begin with. And, this eventually helps us create more features down the line. Sounds good, right?

#create alldata2alldata2 = train.append(test) alldata["SaleCondition_PriceDown"] = alldata2.SaleCondition.replace({'Abnorml': 1, 'Alloca': 1, 'AdjLand': 1, 'Family': 1, 'Normal': 0, 'Partial': 0}) # house completed before sale or notalldata["BoughtOffPlan"] = alldata2.SaleCondition.replace({"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1})alldata["BadHeating"] = alldata2.HeatingQC.replace({'Ex': 0, 'Gd': 0, 'TA': 0, 'Fa': 1, 'Po': 1})

Just like the other categorical variables, there is a presence of column association with the property area as well. We can go ahead and create some new features which are based solely on the year that the house was built in.

Let’s give that a spin:

#calculating total area using all area columns
area_cols = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',` `'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF',` `'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'LowQualFinSF', 'PoolArea' ]

alldata["TotalArea"] = alldata[area_cols].sum(axis=1)
alldata["TotalArea1st2nd"] = alldata["1stFlrSF"] + alldata["2ndFlrSF"]
alldata["Age"] = 2010 - alldata["YearBuilt"]
alldata["TimeSinceSold"] = 2010 - alldata["YrSold"]
alldata["SeasonSold"] = alldata["MoSold"].map({12:0, 1:0, 2:0, 3:1, 4:1, 5:1, 6:2, 7:2, 8:2, 9:3, 10:3, 11:3}).astype(int)
alldata["YearsSinceRemodel"] = alldata["YrSold"] - alldata["YearRemodAdd"]

# Simplifications of existing features into bad/average/good based on counts
alldata["SimplOverallQual"] = alldata.OverallQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3})
alldata["SimplOverallCond"] = alldata.OverallCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3})
alldata["SimplPoolQC"] = alldata.PoolQC.replace({1 : 1, 2 : 1, 3 : 2, 4 : 2})
alldata["SimplGarageCond"] = alldata.GarageCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplGarageQual"] = alldata.GarageQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplFireplaceQu"] = alldata.FireplaceQu.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplFireplaceQu"] = alldata.FireplaceQu.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplFunctional"] = alldata.Functional.replace({1 : 1, 2 : 1, 3 : 2, 4 : 2, 5 : 3, 6 : 3, 7 : 3, 8 : 4})
alldata["SimplKitchenQual"] = alldata.KitchenQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplHeatingQC"] = alldata.HeatingQC.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplBsmtFinType1"] = alldata.BsmtFinType1.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2})
alldata["SimplBsmtFinType2"] = alldata.BsmtFinType2.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2})
alldata["SimplBsmtCond"] = alldata.BsmtCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplBsmtQual"] = alldata.BsmtQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplExterCond"] = alldata.ExterCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})
alldata["SimplExterQual"] = alldata.ExterQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2})

#grouping neighborhood variable based on this plot
train['SalePrice'].groupby(train['Neighborhood']).median().sort_values().plot(kind='bar')

Machine Learning with Python - Intellipaat

From the above graph, you can get a decent idea about combining the levels of the neighborhood variable into something much smaller. Let’s go ahead and implement this—combining bars of equivalent height (almost) into a single category. How do we go about this? Simple, begin by creating a dictionary and map the values from that into the variables.

neighborhood_map = {"MeadowV" : 0, "IDOTRR" : 1, "BrDale" : 1, "OldTown" : 1, "Edwards" : 1, "BrkSide" : 1,` ` "Sawyer" : 1, "Blueste" : 1, "SWISU" : 2, "NAmes" : 2, "NPkVill" : 2, "Mitchel" : 2, "SawyerW" : 2, "Gilbert" : 2, "NWAmes" : 2, "Blmngtn" : 2, "CollgCr" : 2, "ClearCr" : 3, "Crawfor" : 3, "Veenker" : 3, "Somerst" : 3, "Timber" : 3, "StoneBr" : 4, "NoRidge" : 4, "NridgHt" : 4}

alldata['NeighborhoodBin'] = alldata2['Neighborhood'].map(neighborhood_map)
alldata.loc[alldata2.Neighborhood == 'NridgHt', "Neighborhood_Good"] = 1
alldata.loc[alldata2.Neighborhood == 'Crawfor', "Neighborhood_Good"] = 1
alldata.loc[alldata2.Neighborhood == 'StoneBr', "Neighborhood_Good"] = 1
alldata.loc[alldata2.Neighborhood == 'Somerst', "Neighborhood_Good"] = 1
alldata.loc[alldata2.Neighborhood == 'NoRidge', "Neighborhood_Good"] = 1
alldata["Neighborhood_Good"].fillna(0, inplace=True)
alldata["SaleCondition_PriceDown"] = alldata2.SaleCondition.replace({'Abnorml': 1, 'Alloca': 1, 'AdjLand': 1, 'Family': 1, 'Normal': 0, 'Partial': 0})

# House completed before sale or not
alldata["BoughtOffPlan"] = alldata2.SaleCondition.replace({"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1})
alldata["BadHeating"] = alldata2.HeatingQC.replace({'Ex': 0, 'Gd': 0, 'TA': 0, 'Fa': 1, 'Po': 1})
alldata.shape
(2915, 124)

Well, until now, we’ve made sure to add about 43 new (and exciting) features to the dataset, correct? Why not add a little more? A prerequisite to go about doing that is to split the test and the training datasets.

#create new data
train_new = alldata[alldata['SalePrice'].notnull()]
test_new = alldata[alldata['SalePrice'].isnull()]

print Train, train_new.shape
print ('----------------')
print Test, test_new.shape

Train (1456, 126)
----------------
Test (1459, 126)

What is the first thing we do when we add features? We remove the skew.

#get numeric features
numeric_features = [f for f in train_new.columns if train_new[f].dtype != object]

#transform the numeric features using log(x + 1)
from scipy.stats import skew
skewed = train_new[numeric_features].apply(lambda x: skew(x.dropna().astype(float)))
skewed = skewed[skewed > 0.75]
skewed = skewed.index
train_new[skewed] = np.log1p(train_new[skewed])
test_new[skewed] = np.log1p(test_new[skewed])
del test_new['SalePrice']

Next, we can go about standardizing the numeric features:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_new[numeric_features])
scaled = scaler.transform(train_new[numeric_features])

for i, col in enumerate(numeric_features):
train_new[col] = scaled[:,i]

numeric_features.remove('SalePrice')
scaled = scaler.fit_transform(test_new[numeric_features])

for i, col in enumerate(numeric_features):
test_new[col] = scaled[:,i]

Once that is done, we can encode the categorical variables that we spoke about a while ago. Here is a function that does just that:

def onehot(onehot_df, df, column_name, fill_na):
onehot_df[column_name] = df[column_name]
if fill_na is not None:
onehot_df[column_name].fillna(fill_na, inplace=True)

dummies = pd.get_dummies(onehot_df[column_name], prefix="_"+column_name)
onehot_df = onehot_df.join(dummies)
onehot_df = onehot_df.drop([column_name], axis=1)
return onehot_df

def munge_onehot(df):
onehot_df = pd.DataFrame(index = df.index)

onehot_df = onehot(onehot_df, df, "MSSubClass", None)
onehot_df = onehot(onehot_df, df, "MSZoning", "RL")
onehot_df = onehot(onehot_df, df, "LotConfig", None)
onehot_df = onehot(onehot_df, df, "Neighborhood", None)
onehot_df = onehot(onehot_df, df, "Condition1", None)
onehot_df = onehot(onehot_df, df, "BldgType", None)
onehot_df = onehot(onehot_df, df, "HouseStyle", None)
onehot_df = onehot(onehot_df, df, "RoofStyle", None)
onehot_df = onehot(onehot_df, df, "Exterior1st", "VinylSd")
onehot_df = onehot(onehot_df, df, "Exterior2nd", "VinylSd")
onehot_df = onehot(onehot_df, df, "Foundation", None)
onehot_df = onehot(onehot_df, df, "SaleType", "WD")
onehot_df = onehot(onehot_df, df, "SaleCondition", "Normal")

#Fill in missing MasVnrType for rows that do have a MasVnrArea.
temp_df = df[["MasVnrType", "MasVnrArea"]].copy()
idx = (df["MasVnrArea"] != 0) & ((df["MasVnrType"] == "None") | (df["MasVnrType"].isnull()))
temp_df.loc[idx, "MasVnrType"] = "BrkFace"
onehot_df = onehot(onehot_df, temp_df, "MasVnrType", "None")

onehot_df = onehot(onehot_df, df, "LotShape", None)
onehot_df = onehot(onehot_df, df, "LandContour", None)
onehot_df = onehot(onehot_df, df, "LandSlope", None)
onehot_df = onehot(onehot_df, df, "Electrical", "SBrkr")
onehot_df = onehot(onehot_df, df, "GarageType", "None")
onehot_df = onehot(onehot_df, df, "PavedDrive", None)
onehot_df = onehot(onehot_df, df, "MiscFeature", "None")
onehot_df = onehot(onehot_df, df, "Street", None)
onehot_df = onehot(onehot_df, df, "Alley", "None")
onehot_df = onehot(onehot_df, df, "Condition2", None)
onehot_df = onehot(onehot_df, df, "RoofMatl", None)
onehot_df = onehot(onehot_df, df, "Heating", None)

# we'll have these as numerical variables too
onehot_df = onehot(onehot_df, df, "ExterQual", "None")
onehot_df = onehot(onehot_df, df, "ExterCond", "None")
onehot_df = onehot(onehot_df, df, "BsmtQual", "None")
onehot_df = onehot(onehot_df, df, "BsmtCond", "None")
onehot_df = onehot(onehot_df, df, "HeatingQC", "None")
onehot_df = onehot(onehot_df, df, "KitchenQual", "TA")
onehot_df = onehot(onehot_df, df, "FireplaceQu", "None")
onehot_df = onehot(onehot_df, df, "GarageQual", "None")
onehot_df = onehot(onehot_df, df, "GarageCond", "None")
onehot_df = onehot(onehot_df, df, "PoolQC", "None")
onehot_df = onehot(onehot_df, df, "BsmtExposure", "None")
onehot_df = onehot(onehot_df, df, "BsmtFinType1", "None")
onehot_df = onehot(onehot_df, df, "BsmtFinType2", "None")
onehot_df = onehot(onehot_df, df, "Functional", "Typ")
onehot_df = onehot(onehot_df, df, "GarageFinish", "None")
onehot_df = onehot(onehot_df, df, "Fence", "None")
onehot_df = onehot(onehot_df, df, "MoSold", None)

# Divide the years between 1871 and 2010 into slices of 20 years
year_map = pd.concat(pd.Series("YearBin" + str(i+1), index=range(1871+i*20,1891+i*20)) for i in range(0, 7))
yearbin_df = pd.DataFrame(index = df.index)
yearbin_df["GarageYrBltBin"] = df.GarageYrBlt.map(year_map)
yearbin_df["GarageYrBltBin"].fillna("NoGarage", inplace=True)
yearbin_df["YearBuiltBin"] = df.YearBuilt.map(year_map)
yearbin_df["YearRemodAddBin"] = df.YearRemodAdd.map(year_map)

onehot_df = onehot(onehot_df, yearbin_df, "GarageYrBltBin", None)
onehot_df = onehot(onehot_df, yearbin_df, "YearBuiltBin", None)
onehot_df = onehot(onehot_df, yearbin_df, "YearRemodAddBin", None)
return onehot_df

#create one-hot features
onehot_df = munge_onehot(train)

neighborhood_train = pd.DataFrame(index=train_new.shape)
neighborhood_train['NeighborhoodBin'] = train_new['NeighborhoodBin']
neighborhood_test = pd.DataFrame(index=test_new.shape)
neighborhood_test['NeighborhoodBin'] = test_new['NeighborhoodBin']

onehot_df = onehot(onehot_df, neighborhood_train, 'NeighborhoodBin', None)

Adding the variable that we mentioned:

train_new = train_new.join(onehot_df) 
train_new.shape
(1456, 433)

I am sure that you did not expect a 433-column output! Let us check out another variable for the test dataset. Follow along:

#adding one hot features to test
onehot_df_te = munge_onehot(test)
onehot_df_te = onehot(onehot_df_te, neighborhood_test, "NeighborhoodBin", None)
test_new = test_new.join(onehot_df_te)
test_new.shape
(1459, 417)

What I want you to actively track is the change in the value of columns from the testing dataset and the training dataset. Why have extra values? Let’s get rid of these variables and maintain an equal number of columns in both the testing and the training datasets.

#dropping some columns from the train data as they are not found in test
drop_cols = ["_Exterior1st_ImStucc", "_Exterior1st_Stone","_Exterior2nd_Other","_HouseStyle_2.5Fin","_RoofMatl_Membran", "_RoofMatl_Metal", "_RoofMatl_Roll", "_Condition2_RRAe", "_Condition2_RRAn", "_Condition2_RRNn", "_Heating_Floor", "_Heating_OthW", "_Electrical_Mix", "_MiscFeature_TenC", "_GarageQual_Ex", "_PoolQC_Fa"]
train_new.drop(drop_cols, axis=1, inplace=True)
train_new.shape
(1456, 417)

That’s nice to look at—the equal number of columns in both the datasets. Why stop here? Let’s remove more columns which don’t make sense to our application. What data? Find something with lots of zeros in it—this eventually will not hold water when the algorithm goes about learning.

#removing one column missing from train data
test_new.drop(["_MSSubClass_150"], axis=1, inplace=True)

# Drop these columns
drop_cols = ["_Condition2_PosN", # only two are not zero
"_MSZoning_C (all)",
"_MSSubClass_160"]

train_new.drop(drop_cols, axis=1, inplace=True)
test_new.drop(drop_cols, axis=1, inplace=True)

Time to finally apply some transformations on the SalePrice variable and to put it in its own fancy array!

#create a label set
label_df = pd.DataFrame(index = train_new.index, columns = ['SalePrice'])
label_df['SalePrice'] = np.log(train['SalePrice'])
print("Training set size:", train_new.shape)
print("Test set size:", test_new.shape)

('Training set size:', (1456, 414))
('Test set size:', (1459, 413))

That was very interesting, wasn’t it? To achieve Machine Learning with Python, we actually need to train the model. That is exactly what we’re up to next!

Training the Model:

Fellow programmers, rejoice! The data is ready! Time has come for us to train our models. For this particular case, let us make use of three algorithms, Lasso regression, NN, and XGBoost! Eventually, all these three will step in and help us generate the required predictions.

Let us begin. First, we make use of the XGBoost algorithm. This algorithm has been dominating the world of Machine Learning in recent days. It is mainly considered for portable and flexible implementation requirements where the data is either structured or in a tabular format.

Implementing XGBoost:

import xgboost as xgb
regr = xgb.XGBRegressor(colsample_bytree=0.2,
gamma=0.0,
learning_rate=0.05,
max_depth=6,
min_child_weight=1.5,
n_estimators=7200,
reg_alpha=0.9,
reg_lambda=0.6,
subsample=0.2,
seed=42,
silent=1)

regr.fit(train_new, label_df)

We have made use of cross-validation to arrive at these values. We need to implement an RMSE (Root Mean Squared Error) function to check how the model is actually holding up in terms of accuracy.

from sklearn.metrics import mean_squared_error
def rmse(y_test,y_pred):
return np.sqrt(mean_squared_error(y_test,y_pred))

# run prediction on training set to get an idea of how well it does
y_pred = regr.predict(train_new)
y_test = label_df
print("XGBoost score on training set: ", rmse(y_test, y_pred))
XGBoost score on training set: ', 0.037633322832013358)

# make prediction on test set
y_pred_xgb = regr.predict(test_new_one)

#submit this prediction and get the score
pred1 = pd.DataFrame({'Id': test['Id'], 'SalePrice': np.exp(y_pred_xgb)})
pred1.to_csv('xgbnono.csv', header=True, index=False)

The accuracy you get is about 0.12507 with respect to the RMSE score. Good enough, but let’s go ahead and use the Lasso model now.

Lasso regression is a type of linear regression which uses the concept of shrinkage and works based on that. The process of shrinkage literally goes about condensing the values to a central point, like the mean. This works well when models have fewer parameters.

Implementing Lasso Regression:

from sklearn.linear_model import Lasso

#found this best alpha through cross-validation
best_alpha = 0.00099

regr = Lasso(alpha=best_alpha, max_iter=50000)
regr.fit(train_new, label_df)

# run prediction on the training set to get a rough idea of how well it does
y_pred = regr.predict(train_new)
y_test = label_df` `print("Lasso score on training set: ", rmse(y_test, y_pred))

('Lasso score on training set: ', 0.10175440647797629)

There is a very good chance that this might outperform the RMSE. Let’s see:

#make prediction on the test set
y_pred_lasso = regr.predict(test_new_one)
lasso_ex = np.exp(y_pred_lasso)
pred1 = pd.DataFrame({'Id': test['Id'], 'SalePrice': lasso_ex})
pred1.to_csv('lasso_model.csv', header=True, index=False)

So, we got a score of 0.11859! Lasso outperformed XGBoost as mentioned. Machine Learning with Python never fails to amuse us! Since the data has a large number of features, let us get our hands messy and build a neural network too. Keras, here we come!

Implementing a Neural Network using Keras:

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler

np.random.seed(10)

#create Model
#define base model
def base_model():
model = Sequential()
model.add(Dense(20, input_dim=398, init='normal', activation='relu'))
model.add(Dense(10, init='normal', activation='relu'))
model.add(Dense(1, init='normal'))
model.compile(loss='mean_squared_error', optimizer = 'adam')
return model

seed = 7
np.random.seed(seed)

scale = StandardScaler()
X_train = scale.fit_transform(train_new)
X_test = scale.fit_transform(test_new)

keras_label = label_df.as_matrix()
clf = KerasRegressor(build_fn=base_model, nb_epoch=1000, batch_size=5,verbose=0)
clf.fit(X_train,keras_label)

#make predictions and create the submission file 
kpred = clf.predict(X_test) 
kpred = np.exp(kpred)
pred_df = pd.DataFrame(kpred, index=test["Id"], columns=["SalePrice"]) 
pred_df.to_csv('keras1.csv', header=True, index_label='Id')

When executed, the RMSE of 1.35345 is achieved. Looks like our older models were performing better.

Looks like we have finally achieved the goal we initially set out with. Let us conclude this Machine Learning with Python blog now.


Conclusion

The best way to go about mastering Machine Learning with Python is to begin working on Python projects and using the project-oriented approach on the whole. The number of applications of Python and of Machine Learning with Python is really HUGE. No wonder they both have acquired a learner/user base of millions!

I hope this Machine Learning with Python tutorial blog helped you get an overall picture of working with an actual dataset and training an algorithm to perform a simple task. If you have any questions, put them below in the comment section, I’d be happy to help you out!

 

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
17 + 29 =