# Machine Learning with Python

- Updated on: 27th Aug, 19
- 1694 Views

**Machine Learning with Python**

Python is an extremely powerful interpreted language which is quite popular in the fields of development, research, and other useful systems. Python has gained a lot of traction among a wide variety of learners, researchers, and enthusiasts.

Since Python provides off-hand access to a huge repository of libraries and frameworks, it might get a little overwhelming to plot a roadmap to pave a way for your understanding of Machine Learning with Python. The recommended way to get started with Python is to understand and implement projects first-hand and to go about playing around with code if you wish to master **Machine Learning with Python**. This, in fact, holds water not just with the beginners but even experienced professionals too.

In this Machine Learning with Python blog, the following concepts will be covered:

**Watch This Machine Learning Tutorial**

The overall goal here is to show you how you can go ahead and learn your first project in Machine Learning with Python.

**How does Machine Learning work?**

There is an infamous quote: **‘Mess around with data so much that it eventually lets you in on its juicy secrets.’** Well, **Machine Learning** does just this! You might feel that Machine Learning is all about just fiddling around with some data and pushing it into an algorithm to get certain insights (converting data into useful information). And these algorithms will output some prediction data or models which you can further use to make sense of and use it for analysis. There is a certain flow you must follow as a learner which is very vital to juxtapose how a native programmer writes code versus how a nascent learner gets up to speed with the technology involving Machine Learning with Python.

*Major steps involved in understanding Machine Learning*

Let us have a look at this so-called Machine Learning ‘secret sauce’:

**Figuring out the problem:** Well, the primary thing you must always go about doing is to find out a problem. Once that is done, you can work on the steps necessary to find a solution to the problem at hand.

**Generating the Hypothesis:** Primarily, you should be looking at creation and developing an understanding of the features that you plan to implement in the project.

**Data Gathering:** This step majorly involves getting the data onto your machine. It may involve the creation, downloading, and developing the data.

**Data Understanding:** The question you need to ask yourself is if you could discern and understand data just by taking a look at it. If you could, you can go ahead and follow the hierarchy down below! This step majorly involves diving deep into the data and exploring the same.

**Data Processing:** This is a step which majorly handles the process of cleaning up the data by doing a variety of operations. To give you an insight, the process involves removing whitespaces, tab data, and even formatting the date-time inputs as well.

**Feature Engineering:** At this point, you have the dataset ready. Now, you need to go about adding new features to the dataset that you’re working with. If you’re a native programmer, you might’ve gone about doing this in Step 5 itself!

**Training the Model:** We make use of an algorithm which we use to train the particular model on the dataset we’ve been fiddling around with.

**Evaluating the Model:** We make use of something called error metric to keep a track of the algorithm and the learning process. This will involve prioritizing variables, avoiding clutters, and other similar procedures. Based on the outcome of this data, you can go ahead and train the model in an iteration.

**Testing the Model:** To know if the model is a success, it is as candid as it gets. You need to show your model some data which it has never seen before. We call this the testing dataset; this will help us in finding out if the model is successful and efficient in its work.

Now that we know the procedure which is recommended to kick start your Machine Learning with Python journey, let’s begin by checking them out individually. Also, these carefully curated set of steps will help you in a big way for this project.

Let’s look at how we can use the above set of steps to implement Machine Learning with Python.

**Steps to Implement Machine Learning**

**Step 1: Figuring out the problem**

As all students who have just undergone Introduction to Machine Learning, we shall make use of a dataset from Kaggle too. It is a fairly simple dataset which is basically used to predict prices of houses in a certain residential area—Iowa, USA.

Understanding the problem statement is the most important aspect of achieving high efficiency and good results when you go about working with Machine Learning.

Since, this first step toward your goal of mastering Machine Learning with Python is quite simple to figure out, let’s move on to the next step quickly!

**Step 2: Generating the Hypothesis**

This is a nice step which gets you thinking. Look around you. Can you figure out what the major factors are which have a huge say in defining the price of a house? (You could go ahead and type in some of the factors in the comment box below and we could discuss more on that!)

If you’re looking to get all official and dive deep into the hypothesis development, let me help you with the two parts that are used to define a hypothesis. It can be either the null hypothesis or the alternative hypothesis.

So, what is a Null Hypothesis?

**Null Hypothesis**: A null hypothesis is a type of hypothesis used in statistics that proposes that no statistical significance exists in a set of given observations. The null hypothesis attempts to show that no variation exists between variables or that a single variable is no different than its mean. This does not have a notable impact on a variable which is defined based on the features in this case.

Next, we’ll take a look at the Alternate Hypothesis.

**Alternate Hypothesis:** The alternative hypothesis is the hypothesis used in testing that is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed). Also, alternate Hypothesis involves the absolute straightforward dependence on the feature of the dependent variable at all times.

Here are some of the factors that I think have an important hand in dictating the price of the houses:

- Locality
- Age of the house
- Closeness to emergency services
- Transportation services
- Security
- Vehicle parking
- Connectivity to freeways and city streets

Well, these are just the aspects which came off my head. We can be certain that there are numerous other factors as well. The possibilities that you can achieve with Machine Learning with Python is virtually endless. Since, Python has millions of learners, collaboration is very easy. Here’s the **Machine Learning Community** page where you can collaborate with other learners and get your queries answered. Now, let’s go about and take a look at the next important step on this path.

**Step 3: Data Gathering**

The third step on my Machine Learning with Python blog is gathering data easily. The data is easily available on Kaggle, but you could go ahead and download it directly from here into your Python IDE too. It is filled with 81 variables (which are self-explanatory). I am sure you will have a fun time exploring this dataset. As a Machine Learning with Python enthusiast, you will get accustomed to working with a large amount of data.

The major contender that you should look out for is the **SalePrice** variable. This will help with the goal throughout this project. So, without further ado, let’s get our hands messy with some code and go about achieving Machine Learning with Python!

**Are you interested in mastering Machine Learning in Bangalore? Enroll in our Machine Learning Course In Bangalore!**

**Watch This Video on Keras vs Tensorflow by Intellipaat**

**Step 4: Data Understanding**

Understanding the data is the key to know how to go about understanding Machine Learning with Python. As we’ve already seen, this step is primarily concerned with gathering data and making sense from them. As per experienced programmers, a good data understanding route to take will consist of the following:

**Single Variable Analysis:** This is used to develop a single plot over one variable. Example: Density plots, histograms, etc.

**Double Variable Analysis:** This is used to visualize two variables (both the ‘x’ and the ‘y’ axes) in one single plot. Example: Various type of charts—bar, line, pie plot, etc.

**‘N’ Variable Analysis:** As you could discern from the header, this is used to describe and present more than two variables. Example: Stacked bar chart.

**Comparison Tables:** They are used to present the contrast between categorical variables.

Let us begin the Machine Learning with Python journey with some simple code!

**Load the libraries:**

#Loading libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['figure.figsize'] = (10.0, 8.0) import seaborn as sns from scipy import stats from scipy.stats import norm

**Load the data:**

`#loading datatrain = pd.read_csv("/data/Housing/train.csv")test = pd.read_csv("/data/Housing/test.csv")`

In case you want to take a look at the data, use the following code snippet:

`train.head()`

Let us check out the number of rows and columns our dataset has:

print ('The train data has {0} rows and {1} columns'.format(train.shape[0],train.shape[1])) print ('----------------------------') print ('The test data has {0} rows and {1} columns'.format(test.shape[0],test.shape[1]))

**Output:**

`The train data has 1460 rows and 81 columns----------------------------The test data has 1459 rows and 80 columns`

You can use the info() command as well, to check the contents of the dataset (an alternative to the above-mentioned snippet):

`train.info()`

Next, let’s go about checking to see if our dataset is missing any values:

#check missing values train.columns[train.isnull().any()] ['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

Interesting! We could figure out that 19 features from the dataset have no values. We can express this as a percentage to check out the missing data.

Make use of the following code snippet:

#missing value counts in each of these columns miss = train.isnull().sum()/len(train) miss = miss[miss > 0] miss.sort_values(inplace=True) miss

**Output:**

`Electrical` | `0.000685` |

`MasVnrType` | `0.005479` |

`MasVnrArea` | `0.005479` |

`BsmtQual` | `0.025342` |

`BsmtCond` | `0.025342` |

`BsmtFinType1` | `0.025342` |

`BsmtExposure` | `0.026027` |

`BsmtFinType2` | `0.026027` |

`GarageCond` | `0.055479` |

`GarageQual` | `0.055479` |

`GarageFinish` | `0.055479` |

`GarageType` | `0.055479` |

`GarageYrBlt` | `0.055479` |

`LotFrontage` | `0.177397` |

`FireplaceQu` | `0.472603` |

`Fence` | `0.807534` |

`Alley` | `0.937671` |

`MiscFeature` | `0.963014` |

`PoolQC` | `0.995205` |

`dtype: float64` |

There is a simple inference we can make from the above table. On the left, we have the variables and on the right, we are given a number which shows the percentage of missing values. Say, we take Alley into consideration, we have 93.7 percent of data with missing values. One thing you must know when you’re learning Machine Learning with Python is that keeping the data-optimized is the key. As you can probably figure out by now, Machine Learning with Python is simpler than you initially thought!

On that note, let us visualize these missing values. Use this code:

#visualising missing values miss = miss.to_frame() miss.columns = ['count'] miss.index.names = ['Name'] miss['Name'] = miss.index #plot the missing value count sns.set(style="whitegrid", color_codes=True) sns.barplot(x = 'Name', y = 'count', data=miss) plt.xticks(rotation = 90) sns.plt.show()

Let’s move ahead and look at finding out the distribution of our goal variable **SalePrice.**

#SalePrice sns.distplot(train['SalePrice'])

Take a look at the graph above. Do you notice that our variable **SalePrice** is skewed to the right? Our goal is to put out a graph which is the normal distribution of the variable. How do we achieve this? Simple, let’s make use of logarithmic transformation for this! Why normal distribution? Well, a normally distributed variable is used to mainly help achieve better accuracy by showing the nuances with respect to the target variable **(SalePrice,** in our case) to that of the independent variables. To add on to this, we can make use of the skewness metric to confirm the behavior.

**Verifying the skewing:**

#skewness print "The skewness of SalePrice is {}".format(train['SalePrice'].skew())

**Output:**

`The skewness of SalePrice is 1.88287575977`

The next thing we can do is to apply the log transform operation on this particular variable. This will let us find out if we can move a bit closer to the normalization.

#now transforming the target variable target = np.log(train['SalePrice']) print ('Skewness is', target.skew()) sns.distplot(target)

**Output:**

`Skewness is 0.12133506220520406)`

From the above graph, it is clear that our target variable has been fixed and it looks better and closer to its normalization! But this was one variable, we have another 80 remaining in the chain. What do we do about those? Worry not, fellow learners, we have a method which we can use to plot every variable at once. Before that, there is just one more step, which is to make sure that we split the numerical and categorical variables to get a different perspective.

#separate variables into new data frames numeric_data = train.select_dtypes(include=[np.number]) cat_data = train.select_dtypes(exclude=[np.number]) print ("There are {} numeric and {} categorical columns in train data".format(numeric_data.shape[1],cat_data.shape[1]))`

As you can see, we have about 43 categorical and 38 numeric columns in the training data. Let’s go ahead and remove the identifier variable Id before doing anything else.

Use the following piece of code:

`del numeric_data['Id']`

Pretty simple, right? Next, we need to check out the variables which are numeric (in this case) and which are correlated too. If we do come across any of these variables, we can go ahead and remove them. Why remove them? Well, they do not tell us anything useful. So, we’d be better off without any redundant code!

Here’s the correction plot for the same:

#correlation plot corr = numeric_data.corr() sns.heatmap(corr)

Take a look at the penultimate row from the above map. It is easy to discern the presence of correlation of the variables when juxtaposed against our **SalePrice** variable. But, if you look closely, it can be seen that there are some variables which have more affinity to the target variable than the others. What we can do now is, we can go ahead and use a numeric correlation score which will basically help us get better clarity of the graph.

Check this out:

print (corr['SalePrice'].sort_values(ascending=False)[:15], '\n') #top 15 values print ('----------------------') print (corr['SalePrice'].sort_values(ascending=False)[-5:]) #last 5 values` SalePrice 1.000000 OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 TotalBsmtSF 0.613581 1stFlrSF 0.605852 FullBath 0.560664 TotRmsAbvGrd 0.533723 YearBuilt 0.522897 YearRemodAdd 0.507101 GarageYrBlt 0.486362 MasVnrArea 0.477493 Fireplaces 0.466929 BsmtFinSF1 0.386420 Name: SalePrice, dtype: float64, '\n') ---------------------- YrSold -0.028923 OverallCond -0.077856 MSSubClass -0.084284 EnclosedPorch -0.128578 KitchenAbvGr -0.135907 Name: SalePrice, dtype: float64

Check out the OverallQual feature—it can be seen that 79 percent is in correlation with our target variable. This variable basically denotes the quality of the materials that went into the construction. Well, it is pretty obvious that we prefer to look at the construction material which goes into building our houses! One other important aspect we consider is the living area, right? The variable **GrLivArea** refers to just this (in sq. ft.). Well, the rest of the variables denote if they’d need an extra garage, basement, etc.

Next up on this Machine Learning with Python blog, we can go ahead and take a detailed look at the same variable. Check it out:

train['OverallQual'].unique() array([ 7, 6, 8, 5, 9, 4, 10, 3, 1, 2])

What we need to know is that the overall quality is assessed by a scale of 1–10. This can be considered as an ordinal variable because it follows an order of point-based analytics.

To give you a better idea of the pricing, let us take a quick look at the median sale prices. I know you might be wondering about the usage of the median values here. Any guess? Well, no issues if you didn’t. We need it because our target variable is skewed, remember? There is always a presence of outliers in this case. To go one step further from the outliers, we make use of the median. At the end of the day, the median values are tough against outliers!

I’m sure that we’ve all heard of Pandas. They are very vital for Machine Learning with Python. Let us quickly make use of Pandas to create the aggregated tables.

Here you go:

#let's check the mean price per quality and plot it. pivot = train.pivot_table(index='OverallQual', values='SalePrice', aggfunc=np.median) pivot.sort

**Output:**

`1` | `50150` |

`2` | `60000` |

`3` | `86250` |

`4` | `108000` |

`5` | `133000` |

`6` | `160000` |

`7` | `200141` |

`8` | `269750` |

`9` | `345000` |

`10` | `432390` |

`Name: SalePrice, dtype: int64` |

To make things a little interesting, let us plot the table and figure out what the median behavior looks like when visualized.

Use the following piece of code:

`pivot.plot(kind='bar', color='red')`

That’s a fine curve, don’t you think? (You will get to witness a lot of these when you’re trying to achieve Machine Learning with Python). This is nothing out of the ordinary. It is a simple proportion—the price of the house increases with the quality of the build. Let’s take a look at another variable?

**GrLivArea** visualization:

#GrLivArea variable sns.jointplot(x=train['GrLivArea'], y=train['SalePrice'])

Again, the same proportion holds true. The more living area directly correlates to a bigger price. Since we can see outliers in the graph, we go about clearing them up.

Next, we can go about looking at the selling price which correlates to the SaleCondition. We do not have an expanded insight about its categories though.

sp_pivot = train.pivot_table(index='SaleCondition', values='SalePrice', aggfunc=np.median) sp_pivot

`SaleCondition` | |

`Abnorml` | `130000` |

`AdjLand` | `104000` |

`Alloca` | `148145` |

`Family` | `140500` |

`Normal` | `160000` |

`Partial` | `244600` |

`Name: SalePrice, dtype: int64` |

`sp_pivot.plot(kind='bar',color='red')`

Are you seeing what I see? Machine Learning with Python can get tricky sometimes! The SaleCondition Partial is off the roof! At this point, it is obvious that we cannot go about generating a lot of insights from the data we have. We can go ahead and make use of the ANOVA test to get some clarity about our target variable and the other categorical variables. Now, what is ANOVA? It is a simple statistical test which is used to analyze the nuances (and a huge variance) in the mean of the groups. Let’s look at a simple example:

Consider that we have two variables (X and Y). Let us say that these two variables have three categorization levels (x1, y1, z1 and x2, y2, z2). The ANOVA test will find out if the mean of the values is similar to that of the target variable. By doing this, we can safely remove those variances in the data.

Here’s a guide to Python Interview Questions that you can use to get answers to the top questions that are asked in a Python job interview.

Next on this Machine Learning with Python blog, let us go ahead and create a function which computes the value of ‘p’. We need the ‘p’ values to make sure we can calculate the disparity score. Higher this score, more efficient the feature in the process of predicting the overall sale price.

cat = [f for f in train.columns if train.dtypes[f] == 'object'] def anova(frame): anv = pd.DataFrame() anv['features'] = cat pvals = [] for c in cat: samples = [] for cls in frame[c].unique(): s = frame[frame[c] == cls]['SalePrice'].values samples.append(s) pval = stats.f_oneway(*samples)[1] pvals.append(pval) anv['pval'] = pvals return anv.sort_values('pval') cat_data['SalePrice'] = train.SalePrice.values k = anova(cat_data) k['disparity'] = np.log(1./k['pval'].values) sns.barplot(data=k, x = 'features', y='disparity') plt.xticks(rotation=90) plt

From the above graph, it is easy to discern that the variable Neighborhood turns out to be one of the most important features. Well, this means that people are giving off-hand importance to the neighborhood, quality of materials, quality of the walls, etc.

The next major step to achieve Machine Learning with Python is to process the obtained data efficiently.

**Step 5: Data Processing**

This is a vital step to achieve Machine Learning with Python because, here, we will be dealing with those outlier variables first hand. There are a couple of other things that we will be looking at as well, such as encoding variables, imputing missing values in the variable, and trying to the best of our ability on removing redundancy and clearing out any unwarranted inconsistencies from the dataset. Do you remember that we spotted an outlier in the living are variable before? Let us make sure we remove that, it is really easy.

#removing outliers train.drop(train[train['GrLivArea'] > 4000].index, inplace=True) train.shape #removed 4 rows` (1456, 81)

Well, in the 666^{th} row of our test dataset, it seems like the data related to the garage requirements are missing. We can impute those too.

#imputing using mode test.loc[666, 'GarageQual'] = "TA" #stats.mode(test['GarageQual']).mode test.loc[666, 'GarageCond'] = "TA" #stats.mode(test['GarageCond']).mode test.loc[666, 'GarageFinish'] = "Unf" #stats.mode(test['GarageFinish']).mode test.loc[666, 'GarageYrBlt'] = "1980" #np.nanmedian(test['GarageYrBlt'])`

Next, we can go ahead and encode all of the categorical variables. Why do we need this? Because most of the Machine Learning algorithms that exist do not like categorical variables. We make use of Sklearn to encode the variables. Sklearn is one of the most important libraries used to achieve Machine Learning with Python. Specifically, we make use of the LabelEncoder function from Sklearn.

Here’s the function to do that:

#importing function from sklearn.preprocessing import LabelEncoder le = LabelEncoder() def factorize(data, var, fill_na = None): if fill_na is not None: data[var].fillna(fill_na, inplace=True) le.fit(data[var]) data[var] = le.transform(data[var]) return data

The above function replaces the blank levels with that of the corresponding mode values. One thing to note is that we need to manually input the node values here.

Next, we can look at the **LotFrontage **variable (You don’t need to know all the variables when you work with Machine learning with Python, but it recommended that you do). The data exploration step says we need to impute the values here too. It is very important that you give yourself some time and put in a decent effort when it comes to data exploration. To go about doing the same, we will go ahead and combine both our training and test datasets. We do this so we can go about modifying the values at the same time in one shot. We all love to save some time, don’t we?

Use the following code snippet:

#combine the data set alldata = train.append(test) alldata.shape (2915, 81)

So this dataset has 2,915 rows and 81 columns as we can see. Let’s go about imputing the **LotFrontage **variable now.

#impute lotfrontage by median of neighborhood lot_frontage_by_neighborhood = train['LotFrontage'].groupby(train['Neighborhood']) for key, group in lot_frontage_by_neighborhood: idx = (alldata['Neighborhood'] == key) & (alldata['LotFrontage'].isnull()) alldata.loc[idx, 'LotFrontage'] = group.median()

For the other numeric variables in consideration, we can impute the values which are missing by zero. So that should take care of it!

#imputing missing values alldata["MasVnrArea"].fillna(0, inplace=True) alldata["BsmtFinSF1"].fillna(0, inplace=True) alldata["BsmtFinSF2"].fillna(0, inplace=True) alldata["BsmtUnfSF"].fillna(0, inplace=True) alldata["TotalBsmtSF"].fillna(0, inplace=True) alldata["GarageArea"].fillna(0, inplace=True) alldata["BsmtFullBath"].fillna(0, inplace=True) alldata["BsmtHalfBath"].fillna(0, inplace=True) alldata["GarageCars"].fillna(0, inplace=True) alldata["GarageYrBlt"].fillna(0.0, inplace=True) alldata["PoolArea"].fillna(0, inplace=True)

Anything with the presence of ‘qual’ or ‘quality’ in the variable name can be played around with as ordered variable sets. Now, it is time to convert the categorical variables into ordered variables. How to go about doing this? Start by creating a simple dictionary of key-value pairs and later use that to map out the variables in the given dataset.

Check this out:

qual_dict = {np.nan: 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5} name = np.array(['ExterQual','PoolQC' ,'ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu', 'GarageQual','GarageCond']) for i in name: alldata[i] = alldata[i].map(qual_dict).astype(int) alldata["BsmtExposure"] = alldata["BsmtExposure"].map({np.nan: 0, "No": 1, "Mn": 2, "Av": 3, "Gd": 4}).astype(int) bsmt_fin_dict = {np.nan: 0, "Unf": 1, "LwQ": 2, "Rec": 3, "BLQ": 4, "ALQ": 5, "GLQ": 6} alldata["BsmtFinType1"] = alldata["BsmtFinType1"].map(bsmt_fin_dict).astype(int) alldata["BsmtFinType2"] = alldata["BsmtFinType2"].map(bsmt_fin_dict).astype(int) alldata["Functional"] = alldata["Functional"].map({np.nan: 0, "Sal": 1, "Sev": 2, "Maj2": 3, "Maj1": 4, "Mod": 5, "Min2": 6, "Min1": 7, "Typ": 8}).astype(int) alldata["GarageFinish"] = alldata["GarageFinish"].map({np.nan: 0, "Unf": 1, "RFn": 2, "Fin": 3}).astype(int) alldata["Fence"] = alldata["Fence"].map({np.nan: 0, "MnWw": 1, "GdWo": 2, "MnPrv": 3, "GdPrv": 4}).astype(int) #encoding data alldata["CentralAir"] = (alldata["CentralAir"] == "Y") * 1.0 varst = np.array(['MSSubClass','LotConfig','Neighborhood','Condition1','BldgType','HouseStyle','RoofStyle','Foundation','SaleCondition']) for x in varst: factorize(alldata, x) #encode variables and impute missing values alldata = factorize(alldata, "MSZoning", "RL") alldata = factorize(alldata, "Exterior1st", "Other") alldata = factorize(alldata, "Exterior2nd", "Other") alldata = factorize(alldata, "MasVnrType", "None") alldata = factorize(alldata, "SaleType", "Oth")`

Next up on this Machine Learning with Python blog is to understand one very import aspect of Data Science which is feature engineering.

### Watch this complete course video on Machine Learning Algorithms

**Step 6: Feature engineering**

It is a domain which requires some hands-on experience and good domain knowledge. It is also very vital for you to be creative. Data exploration will take care of the ideas needed for new features or in the whereabouts. The basic idea here is to create new features which drive the algorithm to make faster and better predictions.

We are on our way to achieving Machine Learning with Python. We already have 81 features. Let’s go ahead and create some new and creative features for it. We’re already aware that the majority of the categorical variables have an infinitesimal variation. We can go ahead and create some features which either contain no values or a singleton entity. We shall make use of comments so you can keep a track of what is going on actively.

#creating new variable (1 or 0) based on irregular count levels #The level with highest count is kept as 1 and rest as 0 alldata["IsRegularLotShape"] = (alldata["LotShape"] == "Reg") * 1 alldata["IsLandLevel"] = (alldata["LandContour"] == "Lvl") * 1 alldata["IsLandSlopeGentle"] = (alldata["LandSlope"] == "Gtl") * 1 alldata["IsElectricalSBrkr"] = (alldata["Electrical"] == "SBrkr") * 1 alldata["IsGarageDetached"] = (alldata["GarageType"] == "Detchd") * 1 alldata["IsPavedDrive"] = (alldata["PavedDrive"] == "Y") * 1 alldata["HasShed"] = (alldata["MiscFeature"] == "Shed") * 1 alldata["Remodeled"] = (alldata["YearRemodAdd"] != alldata["YearBuilt"]) * 1 #Did the modeling happen during the sale year? alldata["RecentRemodel"] = (alldata["YearRemodAdd"] == alldata["YrSold"]) * 1 # Was this house sold in the year it was built? alldata["VeryNewHouse"] = (alldata["YearBuilt"] == alldata["YrSold"]) * 1 alldata["Has2ndFloor"] = (alldata["2ndFlrSF"] == 0) * 1 alldata["HasMasVnr"] = (alldata["MasVnrArea"] == 0) * 1 alldata["HasWoodDeck"] = (alldata["WoodDeckSF"] == 0) * 1 alldata["HasOpenPorch"] = (alldata["OpenPorchSF"] == 0) * 1 alldata["HasEnclosedPorch"] = (alldata["EnclosedPorch"] == 0) * 1 alldata["Has3SsnPorch"] = (alldata["3SsnPorch"] == 0) * 1 alldata["HasScreenPorch"] = (alldata["ScreenPorch"] == 0) * 1 #setting levels with high count as 1 and the rest as 0 #you can check for them using the value_counts function alldata["HighSeason"] = alldata["MoSold"].replace(` `{1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0}) alldata["NewerDwelling"] = alldata["MSSubClass"].replace(` `{20: 1, 30: 0, 40: 0, 45: 0,50: 0, 60: 1, 70: 0, 75: 0, 80: 0, 85: 0,` `90: 0, 120: 1, 150: 0, 160: 0, 180: 0, 190: 0})

With that out of the way, we can take a look at the number of columns we got from that:

alldata.shape (2915, 100)

What is this telling us? Simple, we have 100 features present in the dataset. So, we’ve created 19 more than the original 81. We need to work on creating another file, so let us merge our training and testing files. Let’s go about doing that. So, what is special about this new file? Well, it will certainly contain all the original feature values, to begin with. And, this eventually helps us create more features down the line. Sounds good, right?

`#create alldata2alldata2 = train.append(test) alldata["SaleCondition_PriceDown"] = alldata2.SaleCondition.replace({'Abnorml': 1, 'Alloca': 1, 'AdjLand': 1, 'Family': 1, 'Normal': 0, 'Partial': 0}) # house completed before sale or notalldata["BoughtOffPlan"] = alldata2.SaleCondition.replace({"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1})alldata["BadHeating"] = alldata2.HeatingQC.replace({'Ex': 0, 'Gd': 0, 'TA': 0, 'Fa': 1, 'Po': 1})`

Just like the other categorical variables, there is a presence of column association with the property area as well. We can go ahead and create some new features which are based solely on the year that the house was built in.

Let’s give that a spin:

#calculating total area using all area columns area_cols = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',` `'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF',` `'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'LowQualFinSF', 'PoolArea' ] alldata["TotalArea"] = alldata[area_cols].sum(axis=1) alldata["TotalArea1st2nd"] = alldata["1stFlrSF"] + alldata["2ndFlrSF"] alldata["Age"] = 2010 - alldata["YearBuilt"] alldata["TimeSinceSold"] = 2010 - alldata["YrSold"] alldata["SeasonSold"] = alldata["MoSold"].map({12:0, 1:0, 2:0, 3:1, 4:1, 5:1, 6:2, 7:2, 8:2, 9:3, 10:3, 11:3}).astype(int) alldata["YearsSinceRemodel"] = alldata["YrSold"] - alldata["YearRemodAdd"] # Simplifications of existing features into bad/average/good based on counts alldata["SimplOverallQual"] = alldata.OverallQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3}) alldata["SimplOverallCond"] = alldata.OverallCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2, 7 : 3, 8 : 3, 9 : 3, 10 : 3}) alldata["SimplPoolQC"] = alldata.PoolQC.replace({1 : 1, 2 : 1, 3 : 2, 4 : 2}) alldata["SimplGarageCond"] = alldata.GarageCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplGarageQual"] = alldata.GarageQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplFireplaceQu"] = alldata.FireplaceQu.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplFireplaceQu"] = alldata.FireplaceQu.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplFunctional"] = alldata.Functional.replace({1 : 1, 2 : 1, 3 : 2, 4 : 2, 5 : 3, 6 : 3, 7 : 3, 8 : 4}) alldata["SimplKitchenQual"] = alldata.KitchenQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplHeatingQC"] = alldata.HeatingQC.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplBsmtFinType1"] = alldata.BsmtFinType1.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2}) alldata["SimplBsmtFinType2"] = alldata.BsmtFinType2.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2, 6 : 2}) alldata["SimplBsmtCond"] = alldata.BsmtCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplBsmtQual"] = alldata.BsmtQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplExterCond"] = alldata.ExterCond.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) alldata["SimplExterQual"] = alldata.ExterQual.replace({1 : 1, 2 : 1, 3 : 1, 4 : 2, 5 : 2}) #grouping neighborhood variable based on this plot train['SalePrice'].groupby(train['Neighborhood']).median().sort_values().plot(kind='bar')

From the above graph, you can get a decent idea about combining the levels of the neighborhood variable into something much smaller. Let’s go ahead and implement this—combining bars of equivalent height (almost) into a single category. How do we go about this? Simple, begin by creating a dictionary and map the values from that into the variables.

neighborhood_map = {"MeadowV" : 0, "IDOTRR" : 1, "BrDale" : 1, "OldTown" : 1, "Edwards" : 1, "BrkSide" : 1,` ` "Sawyer" : 1, "Blueste" : 1, "SWISU" : 2, "NAmes" : 2, "NPkVill" : 2, "Mitchel" : 2, "SawyerW" : 2, "Gilbert" : 2, "NWAmes" : 2, "Blmngtn" : 2, "CollgCr" : 2, "ClearCr" : 3, "Crawfor" : 3, "Veenker" : 3, "Somerst" : 3, "Timber" : 3, "StoneBr" : 4, "NoRidge" : 4, "NridgHt" : 4} alldata['NeighborhoodBin'] = alldata2['Neighborhood'].map(neighborhood_map) alldata.loc[alldata2.Neighborhood == 'NridgHt', "Neighborhood_Good"] = 1 alldata.loc[alldata2.Neighborhood == 'Crawfor', "Neighborhood_Good"] = 1 alldata.loc[alldata2.Neighborhood == 'StoneBr', "Neighborhood_Good"] = 1 alldata.loc[alldata2.Neighborhood == 'Somerst', "Neighborhood_Good"] = 1 alldata.loc[alldata2.Neighborhood == 'NoRidge', "Neighborhood_Good"] = 1 alldata["Neighborhood_Good"].fillna(0, inplace=True) alldata["SaleCondition_PriceDown"] = alldata2.SaleCondition.replace({'Abnorml': 1, 'Alloca': 1, 'AdjLand': 1, 'Family': 1, 'Normal': 0, 'Partial': 0}) # House completed before sale or not alldata["BoughtOffPlan"] = alldata2.SaleCondition.replace({"Abnorml" : 0, "Alloca" : 0, "AdjLand" : 0, "Family" : 0, "Normal" : 0, "Partial" : 1}) alldata["BadHeating"] = alldata2.HeatingQC.replace({'Ex': 0, 'Gd': 0, 'TA': 0, 'Fa': 1, 'Po': 1}) alldata.shape (2915, 124)

Well, until now, we’ve made sure to add about 43 new (and exciting) features to the dataset, correct? Why not add a little more? A prerequisite to go about doing that is to split the test and the training datasets.

#create new data train_new = alldata[alldata['SalePrice'].notnull()] test_new = alldata[alldata['SalePrice'].isnull()] print Train, train_new.shape print ('----------------') print Test, test_new.shape Train (1456, 126) ---------------- Test (1459, 126)

What is the first thing we do when we add features? We remove the skew.

#get numeric features numeric_features = [f for f in train_new.columns if train_new[f].dtype != object] #transform the numeric features using log(x + 1) from scipy.stats import skew skewed = train_new[numeric_features].apply(lambda x: skew(x.dropna().astype(float))) skewed = skewed[skewed > 0.75] skewed = skewed.index train_new[skewed] = np.log1p(train_new[skewed]) test_new[skewed] = np.log1p(test_new[skewed]) del test_new['SalePrice']

Next, we can go about standardizing the numeric features:

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(train_new[numeric_features]) scaled = scaler.transform(train_new[numeric_features]) for i, col in enumerate(numeric_features): train_new[col] = scaled[:,i] numeric_features.remove('SalePrice') scaled = scaler.fit_transform(test_new[numeric_features]) for i, col in enumerate(numeric_features): test_new[col] = scaled[:,i]

Once that is done, we can encode the categorical variables that we spoke about a while ago. Here is a function that does just that:

def onehot(onehot_df, df, column_name, fill_na): onehot_df[column_name] = df[column_name] if fill_na is not None: onehot_df[column_name].fillna(fill_na, inplace=True) dummies = pd.get_dummies(onehot_df[column_name], prefix="_"+column_name) onehot_df = onehot_df.join(dummies) onehot_df = onehot_df.drop([column_name], axis=1) return onehot_df def munge_onehot(df): onehot_df = pd.DataFrame(index = df.index) onehot_df = onehot(onehot_df, df, "MSSubClass", None) onehot_df = onehot(onehot_df, df, "MSZoning", "RL") onehot_df = onehot(onehot_df, df, "LotConfig", None) onehot_df = onehot(onehot_df, df, "Neighborhood", None) onehot_df = onehot(onehot_df, df, "Condition1", None) onehot_df = onehot(onehot_df, df, "BldgType", None) onehot_df = onehot(onehot_df, df, "HouseStyle", None) onehot_df = onehot(onehot_df, df, "RoofStyle", None) onehot_df = onehot(onehot_df, df, "Exterior1st", "VinylSd") onehot_df = onehot(onehot_df, df, "Exterior2nd", "VinylSd") onehot_df = onehot(onehot_df, df, "Foundation", None) onehot_df = onehot(onehot_df, df, "SaleType", "WD") onehot_df = onehot(onehot_df, df, "SaleCondition", "Normal") #Fill in missing MasVnrType for rows that do have a MasVnrArea. temp_df = df[["MasVnrType", "MasVnrArea"]].copy() idx = (df["MasVnrArea"] != 0) & ((df["MasVnrType"] == "None") | (df["MasVnrType"].isnull())) temp_df.loc[idx, "MasVnrType"] = "BrkFace" onehot_df = onehot(onehot_df, temp_df, "MasVnrType", "None") onehot_df = onehot(onehot_df, df, "LotShape", None) onehot_df = onehot(onehot_df, df, "LandContour", None) onehot_df = onehot(onehot_df, df, "LandSlope", None) onehot_df = onehot(onehot_df, df, "Electrical", "SBrkr") onehot_df = onehot(onehot_df, df, "GarageType", "None") onehot_df = onehot(onehot_df, df, "PavedDrive", None) onehot_df = onehot(onehot_df, df, "MiscFeature", "None") onehot_df = onehot(onehot_df, df, "Street", None) onehot_df = onehot(onehot_df, df, "Alley", "None") onehot_df = onehot(onehot_df, df, "Condition2", None) onehot_df = onehot(onehot_df, df, "RoofMatl", None) onehot_df = onehot(onehot_df, df, "Heating", None) # we'll have these as numerical variables too onehot_df = onehot(onehot_df, df, "ExterQual", "None") onehot_df = onehot(onehot_df, df, "ExterCond", "None") onehot_df = onehot(onehot_df, df, "BsmtQual", "None") onehot_df = onehot(onehot_df, df, "BsmtCond", "None") onehot_df = onehot(onehot_df, df, "HeatingQC", "None") onehot_df = onehot(onehot_df, df, "KitchenQual", "TA") onehot_df = onehot(onehot_df, df, "FireplaceQu", "None") onehot_df = onehot(onehot_df, df, "GarageQual", "None") onehot_df = onehot(onehot_df, df, "GarageCond", "None") onehot_df = onehot(onehot_df, df, "PoolQC", "None") onehot_df = onehot(onehot_df, df, "BsmtExposure", "None") onehot_df = onehot(onehot_df, df, "BsmtFinType1", "None") onehot_df = onehot(onehot_df, df, "BsmtFinType2", "None") onehot_df = onehot(onehot_df, df, "Functional", "Typ") onehot_df = onehot(onehot_df, df, "GarageFinish", "None") onehot_df = onehot(onehot_df, df, "Fence", "None") onehot_df = onehot(onehot_df, df, "MoSold", None) # Divide the years between 1871 and 2010 into slices of 20 years year_map = pd.concat(pd.Series("YearBin" + str(i+1), index=range(1871+i*20,1891+i*20)) for i in range(0, 7)) yearbin_df = pd.DataFrame(index = df.index) yearbin_df["GarageYrBltBin"] = df.GarageYrBlt.map(year_map) yearbin_df["GarageYrBltBin"].fillna("NoGarage", inplace=True) yearbin_df["YearBuiltBin"] = df.YearBuilt.map(year_map) yearbin_df["YearRemodAddBin"] = df.YearRemodAdd.map(year_map) onehot_df = onehot(onehot_df, yearbin_df, "GarageYrBltBin", None) onehot_df = onehot(onehot_df, yearbin_df, "YearBuiltBin", None) onehot_df = onehot(onehot_df, yearbin_df, "YearRemodAddBin", None) return onehot_df #create one-hot features onehot_df = munge_onehot(train) neighborhood_train = pd.DataFrame(index=train_new.shape) neighborhood_train['NeighborhoodBin'] = train_new['NeighborhoodBin'] neighborhood_test = pd.DataFrame(index=test_new.shape) neighborhood_test['NeighborhoodBin'] = test_new['NeighborhoodBin'] onehot_df = onehot(onehot_df, neighborhood_train, 'NeighborhoodBin', None)

Adding the variable that we mentioned:

train_new = train_new.join(onehot_df) train_new.shape (1456, 433)

I am sure that you did not expect a 433-column output! Let us check out another variable for the test dataset. Follow along:

#adding one hot features to test onehot_df_te = munge_onehot(test) onehot_df_te = onehot(onehot_df_te, neighborhood_test, "NeighborhoodBin", None) test_new = test_new.join(onehot_df_te) test_new.shape (1459, 417)

What I want you to actively track is the change in the value of columns from the testing dataset and the training dataset. Why have extra values? Let’s get rid of these variables and maintain an equal number of columns in both the testing and the training datasets.

#dropping some columns from the train data as they are not found in test drop_cols = ["_Exterior1st_ImStucc", "_Exterior1st_Stone","_Exterior2nd_Other","_HouseStyle_2.5Fin","_RoofMatl_Membran", "_RoofMatl_Metal", "_RoofMatl_Roll", "_Condition2_RRAe", "_Condition2_RRAn", "_Condition2_RRNn", "_Heating_Floor", "_Heating_OthW", "_Electrical_Mix", "_MiscFeature_TenC", "_GarageQual_Ex", "_PoolQC_Fa"] train_new.drop(drop_cols, axis=1, inplace=True) train_new.shape (1456, 417)

That’s nice to look at—the equal number of columns in both the datasets. Why stop here? Let’s remove more columns which don’t make sense to our application. What data? Find something with lots of zeros in it—this eventually will not hold water when the algorithm goes about learning.

#removing one column missing from train data test_new.drop(["_MSSubClass_150"], axis=1, inplace=True) # Drop these columns drop_cols = ["_Condition2_PosN", # only two are not zero "_MSZoning_C (all)", "_MSSubClass_160"] train_new.drop(drop_cols, axis=1, inplace=True) test_new.drop(drop_cols, axis=1, inplace=True)

Time to finally apply some transformations on the **SalePrice** variable and to put it in its own fancy array!

#create a label set label_df = pd.DataFrame(index = train_new.index, columns = ['SalePrice']) label_df['SalePrice'] = np.log(train['SalePrice']) print("Training set size:", train_new.shape) print("Test set size:", test_new.shape) ('Training set size:', (1456, 414)) ('Test set size:', (1459, 413))

That was very interesting, wasn’t it? To achieve Machine Learning with Python, we actually need to train the model. That is exactly what we’re up to next!

**T****raining the Model:**

Fellow programmers, rejoice! The data is ready! Time has come for us to train our models. For this particular case, let us make use of three algorithms, Lasso regression, NN, and XGBoost! Eventually, all these three will step in and help us generate the required predictions.

Let us begin. First, we make use of the XGBoost algorithm. This algorithm has been dominating the world of Machine Learning in recent days. It is mainly considered for portable and flexible implementation requirements where the data is either structured or in a tabular format.

**Implementing XGBoost:**

import xgboost as xgb regr = xgb.XGBRegressor(colsample_bytree=0.2, gamma=0.0, learning_rate=0.05, max_depth=6, min_child_weight=1.5, n_estimators=7200, reg_alpha=0.9, reg_lambda=0.6, subsample=0.2, seed=42, silent=1) regr.fit(train_new, label_df)

We have made use of cross-validation to arrive at these values. We need to implement an RMSE (Root Mean Squared Error) function to check how the model is actually holding up in terms of accuracy.

```
from sklearn.metrics import mean_squared_error
def rmse(y_test,y_pred):
return np.sqrt(mean_squared_error(y_test,y_pred))
# run prediction on training set to get an idea of how well it does
y_pred = regr.predict(train_new)
y_test = label_df
print("XGBoost score on training set: ", rmse(y_test, y_pred))
XGBoost score on training set: ', 0.037633322832013358)
# make prediction on test set
y_pred_xgb = regr.predict(test_new_one)
#submit this prediction and get the score
pred1 = pd.DataFrame({'Id': test['Id'], 'SalePrice': np.exp(y_pred_xgb)})
pred1.to_csv('xgbnono.csv', header=True, index=False)
```

The accuracy you get is about 0.12507 with respect to the RMSE score. Good enough, but let’s go ahead and use the Lasso model now.

Lasso regression is a type of linear regression which uses the concept of shrinkage and works based on that. The process of shrinkage literally goes about condensing the values to a central point, like the mean. This works well when models have fewer parameters.

**Implementing Lasso Regression:**

from sklearn.linear_model import Lasso #found this best alpha through cross-validation best_alpha = 0.00099 regr = Lasso(alpha=best_alpha, max_iter=50000) regr.fit(train_new, label_df) # run prediction on the training set to get a rough idea of how well it does y_pred = regr.predict(train_new) y_test = label_df` `print("Lasso score on training set: ", rmse(y_test, y_pred)) ('Lasso score on training set: ', 0.10175440647797629)

There is a very good chance that this might outperform the RMSE. Let’s see:

#make prediction on the test set y_pred_lasso = regr.predict(test_new_one) lasso_ex = np.exp(y_pred_lasso) pred1 = pd.DataFrame({'Id': test['Id'], 'SalePrice': lasso_ex}) pred1.to_csv('lasso_model.csv', header=True, index=False)

So, we got a score of 0.11859! Lasso outperformed XGBoost as mentioned. Machine Learning with Python never fails to amuse us! Since the data has a large number of features, let us get our hands messy and build a neural network too. Keras, here we come!

**Implementing a Neural Network using Keras:**

from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasRegressor from sklearn.preprocessing import StandardScaler np.random.seed(10) #create Model #define base model def base_model(): model = Sequential() model.add(Dense(20, input_dim=398, init='normal', activation='relu')) model.add(Dense(10, init='normal', activation='relu')) model.add(Dense(1, init='normal')) model.compile(loss='mean_squared_error', optimizer = 'adam') return model seed = 7 np.random.seed(seed) scale = StandardScaler() X_train = scale.fit_transform(train_new) X_test = scale.fit_transform(test_new) keras_label = label_df.as_matrix() clf = KerasRegressor(build_fn=base_model, nb_epoch=1000, batch_size=5,verbose=0) clf.fit(X_train,keras_label) #make predictions and create the submission file kpred = clf.predict(X_test) kpred = np.exp(kpred) pred_df = pd.DataFrame(kpred, index=test["Id"], columns=["SalePrice"]) pred_df.to_csv('keras1.csv', header=True, index_label='Id')

When executed, the RMSE of 1.35345 is achieved. Looks like our older models were performing better.

Looks like we have finally achieved the goal we initially set out with. Let us conclude this Machine Learning with Python blog now.

**Conclusion**

The best way to go about mastering Machine Learning with Python is to begin working on Python projects and using the project-oriented approach on the whole. The number of applications of Python and of Machine Learning with Python is really HUGE. No wonder they both have acquired a learner/user base of millions!

I hope this Machine Learning with Python tutorial blog helped you get an overall picture of working with an actual dataset and training an algorithm to perform a simple task. If you have any questions, put them below in the comment section, I’d be happy to help you out!

## Leave a Reply