Changing the data type of columns of a DataFrame in pandas is one of the fundamental steps in data preprocessing. Whether you are working with numerical type conversions, data handling, or aiming to maximize memory space, choosing the appropriate data type guarantees correct analysis and optimal computation. This article will discuss various ways to alter data types in Pandas.
Table of Contents:
What is a Data Type in Pandas?
A data type in Python specifies the type of data contained in a column, e.g., integers, floats, strings, or dates. An appropriate choice of data type maximizes memory and processing efficiency. For instance, an int32 data type takes 4 bytes per value, while int64 takes 8 bytes per value, and string is an object data type that occupies approximately 50-100 bytes plus the metadata, rather than storing the same amount of memory for every object based on the data type.
Pandas allows you to change the data type of columns. You can change the data types of columns in two ways:
- You change the data types of all the columns together.
- You change the data type of a single column separately.
Methods to Change the Data type of a Single Column in Pandas
There are various methods to change the data type of a single column in a DataFrame using Pandas. You can use the .astype() function of Python to change the data type to any other specific data type. There is the pd.to_numeric() function to change the data type into numerics. Finally, pd.to_datetime() is a function that changes the data type into DateTime.
Advance Your Career with Python – Start Learning Now!!
Join thousands of learners building in-demand skills.
Method 1: Using .astype() function in Pandas
The .astype() function is applied specifically to change the type of one column to a specific type. It’s effective and easy to use when you are certain the conversion will be successful, i.e., from numeric strings to int or float. If there are conflicting values (i.e., a string where there is a numeric column), it will throw an error.
When to use: Use .astype() when you need to force a specific data type for a column and know that the data is uniform.
Example:
Output:
Explanation: Here, the data type of col1 was successfully converted to an integer data type from an object data type.
Method 2: Using pd.to_numeric() method in Pandas
The pd.to_numeric() is a more robust data type conversion function and comes with exception handling built in. It would be used most frequently when there’s a mixed collection of values in a column, where some can be converted to a number (’10’) and others are completely invalid (‘invalid’).
When to use: This method is ideal when working with datasets that may have noise or errors in numeric columns.
Example:
Output:
Explanation: Here, the data type of col1 was converted to int64 because that is the largest size supported by the local system.
Method 3: Using pd.to_datetime() in Pandas
The pd.to_datetime() function is meant to convert string or numeric columns into datetime objects. It’s widely applied in dealing with time-series data, e.g., logs, purchase history, or event timestamps. pd.to_datetime() is flexible, and it supports multiple date formats. It can also deal with non-date or invalid strings by converting them into NaT (Not a Time).
When to use: Use this approach when you have to use the column to execute date-based operations, such as filtering by date intervals or aggregating by time intervals.
Example:
Output:
Explanation: Here, the data type of the date column was successfully changed into a datetime object
Method to Change the Data type of Multiple Columns in Pandas
The convert_dtypes() method automatically analyzes all the columns in the DataFrame and converts them to the most suitable data types, e.g., from integer-like objects to integers or object-like strings to categorical types based on memory availability, size of data, and information in the rows. It is helpful if you want to maximize memory efficiency and have each column be allocated the most efficient data type possible according to its contents. Use this approach when you require a rapid and implicit conversion without having to explicitly specify types for every column.
Example:
Output:
Explanation: Here, instead of changing the data type of each column one by one, the function did it all at once.
Get 100% Hike!
Master Most in Demand Skills Now!
Efficient Data Type Conversion in Pandas
Until now, we have learned about various methods to change the data type of columns using Pandas. One of these methods, pd.to_numeric(), can take extra parameters as arguments to make the conversion of data types in various columns even more efficient and flexible to errors. These parameters are ‘error=’ and ‘downcast=’. Let us explore both of these concepts in detail.
Error Handling by .to_numeric() function in Pandas
The column might have some values that cannot be converted to numbers, for example, string data like ‘intellipaat.’ If we use pd.to_numeric() to convert these values into numeric, it might throw an error. To prevent this error, pd.to_numeric() also takes an error argument that allows you to force non-numeric values to be NaN or simply ignore columns containing these values.
The error parameter can take the following values:
- errors=’ignore’ keeps the original values unchanged.
- errors=’raise’ (default) raises an error if the conversion fails.
- errors=’coerce’ forces invalid values to NaNs.
Example:
Output:
Explanation: Here, the code demonstrates what happens when we give different values, ‘coerce’ and ‘ignore,’ to the errors argument.
Downcasting in Pandas
Downcasting is nothing but minimizing the size of numeric types (such as int64 to int8) to conserve memory. pd.to_numeric() defaults to using the largest numeric type for the conversion. But if memory usage is critical, downcasting lets you force a smaller type. This is especially helpful with large datasets where memory usage needs to be optimized, and you know the values will be contained in the smaller data type range.
Example:
Output:
Explanation: Here, we downcasted the data type of the column. We only had to store integers from 1 to 4. Using the int64 data type was not necessary and wasted memory space.
Kickstart Your Coding Journey with Python – 100% Free
Beginner-friendly. No cost. Start now.
Conclusion
Changing the column type in pandas is a crucial skill in effective data preprocessing in Python. Whether you need to transform a single column or multiple columns, methods like .astype(), pd.to_numeric(), and pd.to_datetime() carry it out within a single line of code. They also provide other functionalities like error handling and memory optimization that make data conversion simpler and easier to implement. All you need to do is include an argument as the parameter of the function. Thus, you gained knowledge of these methods, which guarantee effective performance and precision when working with data analysis.
To take your skills to the next level, check out this Python training course and gain hands-on experience. Also, prepare for job interviews with Python interview questions prepared by industry experts.
Changing column data types in Pandas – FAQs
Q1. How to change the column data type to int in Pandas?
You can use .astype(int) or the pd.to_numeric function to convert the data type to int.
Q2. How to change the data type of multiple columns in Pandas?
To change the data type of multiple columns in Pandas, use the pd.convert_datatypes() function of Pandas.
Q3: How do I change the data type of some values within a column?
Use .loc[] or .apply() to selectively change data types in a column.
Q4: How to change the datatype of a column to strings in Pandas?
You can use .astype(str) to change the datatype of a column to a string.
Q5: How to change column datatype in Pandas?
You can change the datatype of Pandas using .astype(), pd.to_numeric(), pd.to_datetime() or pd.convert_datatypes() functions.