In this Python NumPy Tutorial, we will be covering One of the robust and most commonly used Python libraries i.e. Python NumPy. Python library is a collection of script modules which are accessible to a Python program. It helps simplify the programming process and remove the need to rewrite commonly used commands again and again. Okay, so, what is NumPy in Python? Well, NumPy stands for ‘Numerical Python’ which provides a multidimensional array object, an assortment of routines for fast operations on arrays, and various derived objects (such as masked arrays and matrices), including mathematical, logical, basic linear algebra, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic statistical operations, random simulation, and much more.
Some of the key features of NumPy Python are as follows:
In this Python NumPy tutorial, we will see how to use NumPy Python to analyze data on the Starbucks menu. This data set consists of information related to various beverages available at Starbucks which include attributes like Calories, Total Fat (g), Sodium (mg), Total Carbohydrates (g), Cholesterol (mg), Sugars (g), Protein (g), and Caffeine (mg). Here, we will learn how we can work with NumPy, and we will try to figure out the nutrition facts for the Starbucks menu.
Calories, Total Fat (g), Sodium (mg), Total Carbohydrates (g),Cholesterol (mg), Sugars (g), Protein (g),Caffeine (mg),Nutrition_Value 3,0.1,0,5,0,0,0.3,175,5 70,0.1,5,75,10,9,6,75,5 110,1.5,5,60,21,17,7,85,6 100,0.1,5,70,19,18,6,75,4 5,0,0,5,1,0,0.4,75,5
Here, we have the first few rows of the starbucks.csv file, which we’ll be using throughout this Python NumPy tutorial. The data is in the csv (comma-separated values) format—each record is separated by a comma (,)—and rows are separated by a new line. There are approximately 1,800 rows, including the header row, and 9 columns in the file. I hope by now, your basic question, i.e. what is NumPy in Python would have been answered.
Before we start, here is a quick note on the version—we’ll be using Python Version 3.5. Our code examples will be done using Jupyter Notebook.
Here, we have the list of topics covered in this Python NumPy Tutorial:
Before proceeding to NumPy, on important thing if you have ssv format file data set, then convert it into csv format file by using the csv.reader object and pass the keyword argument delimiter as “,” , this will help us to read into the content and split up all the content that are available in the ssv file
import csv with open('Starbucks.csv', 'r') as f: starbucks = list(csv.reader(f, delimiter=',')) print(starbucks)
It’s always good to have data in the right format ina table to make it easier to view:
|Calories||Total Fat (g)||Sodium (mg)||Total Carbohydrates (g)||Cholesterol (mg)||Sugars (g)||Protein (g)||Caffeine (mg)||Nutrition_Value|
As we can observe from the table above, we have the first three rows from the entire table, where the first row contains column headers. The first row is the header row, and the next rows represent the values of different attributes of various beverages at Starbucks. The first element of each row is the Calories, the second is the Total Fat, and so on. We can find the average nutrition value as follows:
Nutrition_Value = [float(item[-1]) for item in Starbucks[1:]] Sum(Nutrition_Value) / len(Nutrition_Value)
Here, we are able to do the calculation in the way we wanted, but the code is a little complex. And, it won’t be fun if we have to repeat something similar every time to compute the average Nutrition_Value. We are lucky enough to use Python NumPy library to make it easier to work with our data.
Let’s explore it!
In NumPy, it is very easy to work with multidimensional arrays. Here in this Python NumPy tutorial, we will dive into various types of multidimensional arrays. Currently, we are focusing on 2-dimensional arrays.
A 2-dimensional array is also called as a matrix. A 2-dimensional array is a collection of rows and columns. By specifying a row number and a column number, we can easily extract an element from a matrix.
In the below 2-dimensional array, the first row is the header row, and the first column is the Caloriescolumn:
|Calories||Total Fat (g)||Sodium (mg)||Total Carbohydrates (g)||Cholesterol (mg)||Sugars (g)||Protein (g)||Caffeine (mg)||Nutrition_Value|
If we pick the element which is present at the first row and in the second column, that is total fat. If we pick the element in the third row and in the second column, here we get 0.1.
In a NumPy array in Python, the rank is specified to the number of dimensions, and each dimension is called an axis. So, the first axis is the row, and the second axis is the column.
These are the basics of matrices. Now, we will see how we can convert our Python list of lists to a NumPy array in Python.
Creating a Python NumPy Array
The numpy.array function is used to create a NumPy array in Python. Here, we just have to pass in a list of lists, and it will automatically generate a NumPy array in Python with the same number of rows and columns. For easy computation, we want all elements in the array to be float elements, so we’ll leave off the header row and the first column that contains strings.
This is one of the limitations of NumPy in Python as, in NumPy all elements in an array have to be of the same Python Data Type. Here, if we include the header row and the first column, then all elements in the array will be read in as a string. So, to do computations in the way we want, like finding the average Nutrition_Value, we need the elements to be presented in floats.
In the below code:
import numpy as np starbucks = np.array(starbucks[1:], dtype=np.float) starbucks
Now, to check the number of rows and columns in our data, we will use the shape property of NumPy arrays.
Alternative Python NumPy Array Creation Methods
Are there other methods to create a NumPy array? Yes, we can use a variety of methods to create NumPy arrays? First, we will look at the creation of an array where every element is zero. The below code will create an array with four rows and three columns, where every element is 0. Here, we will be using numpy.zeros:
import numpy as np empty_array = np.zeros((4,3)) empty_array
An array with all zero elements will be useful at the time when we want an array of fixed size; otherwise, it will not have any value.
Similarly, we can create an array of all ones.
import numpy as np All_One_array = np.ones((4,3)) All_One_array
import numpy as np Random_array = np.random.rand(4,3) Random_array
We can also create an array with random number using numpy.random.rand. Here’s an example:
Creating an array which is completely filled with random numbers can be useful at a time when we want to quickly test our code with sample arrays.
Here in NumPy, we can directly read csv or other files into an array. This can be done using the numpy.genfromtxt function. We will use this Python function on our initial data on Starbucks.
Here is the code:
starbucks= np.genfromtxt(“Starbucks.csv”, delimiter=”,”, skip_header=1)
Here, if we read it into a list and then convert it to an array of floats, the Starbucks will be looking the same. Here, NumPy in Python will automatically pick up a data type for the elements in the array based on their format.
So, how we can do indexing and slicing in the created NumPy arrays to retrieve results from them? Let’s get further into this Python NumPy tutorial and learn about that as well. In NumPy, the index for the first row and the first column starts with 0. Suppose, if we want to select the fifth column, then its index will be 4, or if we want to select third row data, then its index will be 2, and so on.
Let’s say, we want to select the element at row 7 and column 3. Here, we will pass index 6 as row index:
starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) starbucks[6,4]
and the index 4 as the column index:
Thus, with the help of index, we have seen indexing.
Suppose, we want to select the first five elements from the second column. This we can implement by using a colon (:). A colon in slicing indicates that we want to select all elements from the starting index excluding the ending index.
starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) starbucks[0:5,1]
And suppose we want to select the entire column then just by using the colon (:), with no starting or ending indices we will get the desired result.
starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) starbucks[0:5,1]
And suppose we want to select the entire array then use two colons to select all the rows and column. But this is not required while creating a good application.
starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) starbucks[:,:]
Now, how can we assign values to certain elements in arrays?
We can do that by directly assigning the value to a particular element. Here is an example:
starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) starbucks[1,5] =10 starbucks[1,5]
starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) starbucks[:5]=10
Even, we can overwrite the entire column by using this code. The above code will overwrite the entire sixth column with 10.
Currently, we have worked with Starbucks array which was a 2-dimensional array. However, the NumPy in Python package provides us a privilege to work with multidimensional arrays. The most common multidimensional
Fifth_starbucks= np.genfromtxt("Starbucks.csv", delimiter=",", skip_header=1) Fifth_starbucks = starbucks[5,:] Fifth_starbucks
array is a 1-dimensional array. Previously, when we sliced the Starbucks data, there we had created a 1-dimensional array. A 1-dimensional array will have a single index to retrieve an element from it. Interestingly, each row and each column in a 2-dimensional array is treated as a 1-dimensional array. As for a list of lists the analogous is a 2-dimensional array, for a single list the analogous is a 1-dimensional array. Suppose, we slice the Starbucks data and retrieve only the fifth row, then as output we will receive a 1-dimensional array.
And suppose if we want to retrieve an individual element from Fifth_starbucks we can do that by using a single index.
Even most of the NumPy functions, such as numpy.random.rand which we used with multidimensional arrays to generate a random vector, can be used with a 1-dimensional array as well. Here, we just have to pass the single parameter.
Mostly for our applications, we deal with 1-, 2-, and 3-dimensional arrays. Though, it is very real that we might come across an array which is more than a 3-dimensional array. Imagine this as a list of lists of lists.
For a better understanding of this, let’s take an example of the monthly earning of a supermarket. The month-wise data will be in the form of a list, and if we want a quick look on it we can see the data in quarter-wise and year-wise formats.
A monthly earning of a supermarket will look something like this:
[400, 250, 300, 470, 560, 630, 820, 740, 605, 340,420,340]
Here, the supermarket has earned $400 in January, $250 in February, and so on. Now, if we split this earning quarter-wise, then it will be a list of lists:
One_Year = [ [400, 250, 300], [470, 560, 630], [820, 740, 605] [340,420,340]] ]
Here, we can retrieve the earning of the month of January by calling One_Year,, and if we want the result for a complete quarter, then we can all One_Year or One_Year. So, this is a two-dimensional array.
But, if we add the earning of another year, then it will become the third dimension:
Yearly_Earning = [ [ [400, 250, 300], [470, 560, 630], [820, 740, 605], [340,420,340]] ] [ [500, 350, 430], [430, 760, 640], [720, 530, 800], [345,900, 700] ] ]
Here, we can retrieve the earnings for the month of January in the first year by calling Yearly_Earning,,.
So, we need three indexes to retrieve a single element. We have the same case in a 3-dimensional array in NumPy; in fact, we can convert this Yearly_Earning to an array, and then we can get the earnings for the month of January of the first year. It will be as follows:
Yearly_Earning = np.array(Yearly_Earning) Yearly_Earning [0,0,0]
We will get the result as follows:
Now, to know the shape of the array, we will use:
The result will be as follows:
In 3-dimensional arrays also, indexing and slicing work exactly the same way as they work in 2-dimensional arrays, but here we have to pass in one extra axis.
Suppose if we need the earning for January of all years, then it could be:
The result will be:
Suppose, if we require to get first-quarter earnings from both years, then:
The result will be:
array ( [ [400, 250, 300], [500, 350, 430]])
By adding more dimensions, we can make it much easier for us to query our data as it will be organized in a certain way.
Suppose, if we go from 3-dimensional arrays to 4-dimensional arrays or more than that, we will apply the same properties, and they will be indexed and sliced in the same ways.
As we have discussed earlier in this Python NumPy tutorial, each element of a NumPy array can be stored in a single data type. In our Starbucks example, all elements contain only float values. In NumPy, values are stored using its own data types, which are different from Python data types like float and str. The reason behind this is that the core of NumPy in Python is written in the C programming language, which stores data differently in comparison to the Python data types. NumPy in Python itself maps data types between Python and C and allows us to use NumPy arrays without any conversion hitches.
We can find the data type of a NumPy array by accessing its dtype property:
Even, we have additional data types with a suffix which indicates the bits of memory that the particular data type can take up. Like int 32 is a 32-bit integer data type and float 64 is a 64-bit float data type.
Converting Python NumPy Data Types
To convert an array to a different type, we can use the numpy.ndarray.astype method. This method will make a copy of the actual array and will return a new array with the specified data type. For example, if we want to convert Starbucks data to the int data type, we need to perform this:
In the output, we can observe that all elements in the resulting array are integers. To check the name property of the dtype of the resulting array, we will use the following code:
integer_starbucks = starbucks.astype(int) integer_starbucks.dtype.name
Here, the array has been converted to a 32-bit integer data type which means that it will be storing the values as 32-bit integers.
If we want more control over the way the array is stored in memory and allow for very long integer values, then we can directly create NumPy dtype objects like numpy.int64:
Now, we can directly use these to convert between data types:
It is easy to perform mathematical operations on arrays using NumPy in Python you’ll see that in the further topics of this Python NumPy tutorial.
It is easy to perform basic arithmetic operations on NumPy arrays. We can use +, -, *, and / symbols or add(), subtract(), multiply(), and divide() methods to perform basic operations like addition, subtraction, multiplication, and division, respectively. By using the sqrt() function, we can find the square root of each element in a NumPy array.
Let’s say after the quality check, we want to add 10 to the Nutrition_Value of Starbucks beverages. Here, we will use the following code:
starbucks[:,8] + 10
It is interesting to note that after performing the above operation we need not change the Starbucks array, but a new 1-dimensional array is returned where 10 has been added to each element in the Nutrition_Value column of the Starbucks data.
Similarly, instead of ‘+’ if we modify the array with ‘+=’, then the result will be:
starbucks[:,8] += 10 starbucks[:,8]
In the same way, we can perform other operations. Suppose, we want to multiply each Nutrition_Value by 2. It can be done in this way:
starbucks[:,8] * 2
Multiple Array Math
We can perform mathematical operations between multiple arrays. These operations will be applied to pairs of elements. Suppose, if we add the Nutrition_Value column to itself, here’s what we will get:
starbucks[:,8] + starbucks[:,8]
The output is equivalent to starbucks[:,8] * 2. This is because here each pair of elements is added by NumPy. The first array first element is added to the second array first element, the first array second element to the second array second element, and so on.
Also, we can use this to multiple arrays. Let’s say, we want to pick a beverage that is fat filled with nutrition value, then we have to multiply Total Fat, Protein, and Nutrition_Value, and then we can select the beverage with the highest score.
starbucks[:,1] * starbucks[:,6] * starbucks[:,8]
We can perform all common operations like /, *, -, +, and ^ to work between arrays.
Till now, we have performed operations on exactly same-sized arrays, and they are done with the corresponding elements. But what if the dimension of two arrays is not similar? However, it is possible to perform NumPy in Python on two arrays which are dissimilar by using broadcast. In broadcast, we will try to match up the elements using certain rules. Few essential steps involved in broadcasting are:
For example, we can compare the following two array shapes:
X: (60,5) Y: (5)
The comparison is possible here because the array X has the trailing dimension length as 5, and the array Y has trailing dimension length as 5. They’re equal as trailing dimensions are equal. But array Y is out of elements. So for broadcasting, array X is stretched to become an array with the same shape as array Y, and thus arrays are compatible for mathematical operations.
Let’s take another example which is also compatible
X: (3,4) Y: (10,4)
Here, the last dimension of both arrays is matching, and array X’s first dimension is of length 1.
Now, for better understanding, we will look at another two arrays that don’t match:
X: (52,52) Y: (55,55)
Here, in this example neither the lengths of the dimensions are equal nor any of the arrays has a dimension length equal to 1.
Let’s illustrate the principle of broadcasting with the help of our Starbucks dataset:
starbucks * np.array([1,2])
The error statement ‘ValueError: operands could not be broadcast together with shapes (1888,9) (2,)’ appears as the two arrays don’t have a matching trailing dimension.
Here’s an example where the last dimension is matching:
X_array = np.array( [ [3,4], [5,7] ] ) Y_array = np.array([5,2]) X_array + Y_array
Elements of random_Example_array are broadcast over each row of the Starbucks dataset, so the first column it has the first value in random_Example_array added to it, and so on.
NumPy in Python provides so many methods other than arithmetic operations to solve more complex calculations in the array. One of the most commonly used NumPy array methods is the numpy.ndarray.sum method. This method helps find the sum of all elements in an array when
The total of all values in the Nutrition_Value column is 271.5.
Here, as a keyword argument for the sum method, we can also pass the axis to find sums over an axis.
Suppose, we call sum across the Starbucks matrix, and pass in axis as 0, then we will be able to find sums over the first axis of the array. As a result, this will provide the sum of all values in every column.
Sums over the first axis would give us the sum of each column, or another way to think about this is that the specified axis is the one which is ‘going away’.
If we assign ‘axis=0’, this means that we would like the rows to go away, and we are willing to find the sums for each of the remaining axes across each row.
To verify whether our sum is correct, we can check the shape. In our dataset the shape is 9, corresponding to the number of columns.
Here, if we provide ‘axis=1’, then it will find the sums over the second axis of the array.
Other than the sum, in NumPy, we have several other methods which work like the sum method, including:
In NumPy, it is possible to test and check whether the rows match with certain values by using mathematical comparison operations like <, >, >=, <=, and ==.
Suppose in our Starbucks data, we want to check which beverage has a Nutrition_Value greater than 5, we can do this:
starbucks[:,8] > 5
As a result, we will receive a Boolean array which tells us which of the beverages has a Nutrition_Value greater than 5. We can perform similar things with the other operators. For instance, we can see if there is any beverage which has a Nutrition_Value equal to 10:
starbucks[:,8] == 10
With a Boolean and a NumPy array, one of the powerful things we can do is to select only certain rows or columns as per our requirement. For example, we will select only those rows from the Starbucks data where Nutrition_Value of beverages is greater than 5.
Highly_Nutrition = starbucks[:,8] > 5 starbucks[Highly_Nutrition,:][:3,:]
Here, we have selected only three rows where Highly_Nutrition contains the value True, with their all columns.
So, subsetting makes it easier to filter arrays with certain criteria.
Another example: We want beverages with a lot of Protein and Highly_Nutrition. In order to specify multiple conditions, we will place each condition in parentheses, and we will separate conditions with an ampersand (&):
Highly_Nutrition_and_Max_Protein = (starbucks[:,8] > 5) & (starbucks[:,6] > 6) starbucks[Highly_Nutrition_and_Max_Protein,2:]
Here, we can even combine subsetting and assignment to overwrite certain values in an array:
Highly_Nutrition_and_Max_Protein = (starbucks[:,8] > 5) & (starbucks[:,6] > 6) starbucks[Highly_Nutrition_and_Max_Protein,2:] = 5 starbucks[Highly_Nutrition_and_Max_Protein,2:]
In NumPy, it is very easy to change the shape of arrays and still protect all their elements. There are often many functions which make it easier to access array elements.
One of the simplest ways of reshaping an array is to flip its axes, where columns become rows and vice versa. We can perform this operation with the numpy.transpose function:
Another important function is the numpy.ravel function. This function will turn an array into a 1-dimensional representation with a long sequence of values:
Here, we have another example which will help us better understand and see the ordering of numpy.ravel:
Example_Array_One = np.array( [ [1, 2, 3, 4], [5, 6, 7, 8] ] ) Example_Array_One.ravel()
And finally, we are going to use the numpy.reshape function. This function will help us reshape an array to a certain shape as per our requirement. In the below example, we will turn the third row of Starbucks data into a 2-dimensional array with three rows and three columns:
With NumPy, we can easily combine multiple arrays into a single unified array. To perform this task, we can use numpy.vstack which will vertically stack multiple arrays. In this way, the second array’s items are added as new rows to the first array.
Let’s take an example where we want to combine the old Nutritional dataset of Starbucks beverages with our existing dataset, which contains information on the current Nutritional value of Starbucks beverages.
In the below code, we:
import csv with open('starbucks_old_data.csv', 'r') as f: starbucks_old = list(csv.reader(f, delimiter=',')) import numpy as np starbucks_old = np.array(starbucks_old[1:], dtype=np.float) starbucks_old.shape
Here we can see, we have attributes for 196 beverages in the starbucks_old data, we can combine all the wine data.
Now, we will use the vstack function to combine the Starbucks data and the starbucks_old data, and then we will display the shape of the result:
All_beverages = np.vstack((starbucks, starbucks_old)) All_beverages.shape
Here we can observe, the result has 2,084 rows, which is the sum of the number of rows in the Starbucks data and in the starbucks_old data.
Similarly, we can combine arrays horizontally, which means that our number of rows will stay constant, but the columns will be joined. For this purpose, we can use the numpy.hstack function.
Another useful function is numpy.concatenate. it is a general-purpose version of hstack and vstack. With the help of this function, if we want to concatenate two arrays, we can pass them to concatenate specifying the axis keyword argument that we want to concatenate along. When we concatenate along the first axis, it is similar to vstack, and when we concatenate along the second axis, it is similar to hstack:
np.concatenate((starbucks, starbucks_old), axis=0)
This brings us to an end of Python NumPy tutorial. In this Python NumPy tutorial, we learned in detail about the Python NumPy library with the help of a real-time dataset. Here, we have also explored how to perform various operations via the Python NumPy library, which is most commonly used in many Data Science applications. Now, if you are interested in knowing why Python is the most preferred language for data science, you can go through this blog on Python for Data Science.
While in this Python tutorial, we have covered quite a bit of NumPy’s core functionalities, there is still a lot more to know about it. Try out Intellipaat courses like Python for Data Science which covers various techniques of how Python is deployed for Data Science, working with various libraries for Data Science, doing data munging, data cleaning, advanced numeric analysis, and much more in depth than what we were able to cover here.
Practice the examples that have been explained in this Python NumPy tutorial. To become a Data Scientist and a successful and productive team member in the workplace, the Python NumPy library is definitely one of the most important tools to learn about and practice. I hope this Python NumPy tutorial helped you, head over to the next module in this Python tutorial.
Download Interview Questions asked by top MNCs in 2019?