Pandas is an open-source Python package that provides users with high-performing and flexible data structures. These structures are designed to make analyzing relational or labeled data both easy and intuitive. Pandas is one of the most popular and quintessential tools leveraged by data scientists when developing a machine learning model. The most crucial step in the machine learning process is not simply fitting a model to a given data set. Most of the model development process takes place in the pre-processing and data exploration phase. An accurate model requires good predictors and, in order to acquire them, the user must understand the raw data. Through Pandas’ numerous data wrangling and analysis tools, this important step can easily be achieved.
Databricks provides users with the capability of running the Pandas packages within a notebook. Once a spark data frame is converted to a Pandas data frame all Pandas’ functions can easily be applied using the same code as written in a standard Python IDE such as Spyder or a Jupyter Notebook. While converting a spark data frame to a Pandas data frame does result in the loss of performing on a distributed environment, there are several advantages for data scientists in being able to use Pandas over PySpark. The ability to use the Pandas package in Databricks allows data scientists the ability to leverage Databricks specific features without having to learn the new code associated with PySpark. Also, once all necessary Pandas functions are applied to a data frame, it can easily be converted back into a spark data frame to allow for more distributed storage or spark functions. This ability to seamlessly transition between Pandas and Spark data frames aids in making big data analysis accessible to a wider range of data scientists.
The goal of this blog is to highlight some of the central and most commonly used tools in Pandas while illustrating their significance in model development. The analysis will be executed in a Databricks notebook operating on a Python 3 cluster to illustrate the high level of compatibility with standard Python code a data scientist would expect to see in other IDEs. The data set used for this demo consists of a supermarket chain’s sales across multiple stores in a variety of cities. The sales data is broken down by items within the stores. The goal is to predict a certain item’s sale.
Reading the Data
When starting a new Python script, modules required for the analysis, including Pandas, must be imported into the environment:
For this example, the Store Sales data has already been loaded into a spark table. It can be accessed by running the following command:
Since the data in the notebook is now loaded in a spark data frame it must be converted into a Pandas data frame to perform code specific Pandas functions. Once this conversion takes place all Pandas functions can be performed on the data frame as in any other IDE:
To start, there are a variety of viewing and inspecting tools available in order to achieve a better understanding of the raw data. The df.head(n) and df.tail(n) functions allow users to examine the first and last rows respectively:
The df.shape and df.info() provide information about the number of rows and columns in a data frame, the data types, and missing data:
One key difference in using Pandas within Databricks is ensuring the data types are appropriate after conversion. In this instance, all variables are returned as objects which will result in calculation errors. To correct this issue, the numerical variable must be cast as floats:
Basic statistical values, such as the minimum and maximum of a column, can be calculated by adding the appropriate function to the end of the df.column_name in parentheses:
Data Cleaning and Feature Engineering
After getting a general overview and understanding of the data, the next step toward successful model development is cleaning the data and creating new, possibly more influential variables from the existing raw data.
The variable Item_Identifier follows a labeling pattern of letters per each product (i.e., ‘FD’ for food, ‘DR’ for drinks and ‘NC’ for n) followed by a three-digit code. It may be more useful to have a group type variable with just these two letters rather than the entire code. To achieve this, the .map() function applies a selection of only the first two values in the item identifier and returns as a new column labeled Item_Group_Type.
In order to analyze the outlet establishment year as a numerical variable, a new column entitled Outlet_Age can be calculated by subtracting the outlet’s year by the the max year value of the dataset plus one (assuming this data’s collection ended the prior year). The Outlet_Establishment_Year variable can then be dropped using the Pandas df.drop() function.
Since most machine learning models in Python will return errors if null values exist within the data, identifying their existence and rectifying the issue is a crucial step. The code below counts the number of null values for each column in the store sales data frame:
There are a variety of methods to address null values within a data set. Some common approaches include simply removing the rows containing the null values, forward or backward filling of values for timeseries data, or replacing the null value with a calculated value. This data set contains a significant number of null values. Removing the entire subject could lead to a lack of complete data, therefore filling in the missing values is a more appropriate method.
Since the Item_Weight variable is numeric, replacing a null value with the item’s average weight is a logical approach. The first line of the code uses Pandas df.groupby() function combined with the .agg function to find the mean weight of each unique item and store the results in another Pandas data frame. In lines two and three, setting the index of the original data frame to match the index of the new Item_Identifier_Mean data frame allows the null values to be easily imputed with their matching mean values. It is then necessary to check to see if this method resolved all the null values. Line 5 of the code reveals that four rows still contain null values.
To investigate the cause of this persisting issue, the new mean imputed data frame is merged on Item_Idenitifier with the original data frame using Pandas df.merge(). The merge reveals that the items with null values in the new data frame only appeared once in the original data and had no weight information, therefore a mean could not be calculated.
This same method is then applied to the Outlet_Size variable, except the mode of the outlet type is used as the imputed value since Outlet_Size is categorical. To calculate the mode, the outlet types are converted into a Pandas structure called a series, then the mode is applied to each of these series using the .apply function. From there, a similar merge as before can take place. However, since the merge is on a column name instead of the index, both Outlet_Size columns are retained in the new dataset with “_x” and “_y” appended to the names. Only one of these columns is needed, therefore renaming to remove the “_x” and dropping the Outlet_Size_y column is conducted in line 4.
Examining the new cleaned data shows that Outlet_Type no longer contains null values.
Returning to the null Item_Weight values, since there were four items with only one record each, these values can simply be dropped. Checking the final cleaned data set reveals that all null values have been corrected:
Dividing the features into categorical and numerical sets allows for further examination of any other possible incongruities in the data. Using the df.select_dtypes() function, two data frames are created containing only the specified data types:
Through the use of a for loop and the .unique() function, the number of unique values for each categorical feature and the label can be displayed. As is fairly common with categorical variables recorded from different sources, the labeling technique of Item_Fat_Content is inconsistent. “Low Fat” is represented as both “low fat” and “LF” while “Regular” is also recorded as “reg”. This type of discrepancy will cause issues when creating dummy variables.
To fix the fat content labeling issue, the .replace() function applies a new uniform naming schema to the Item_Fat_Content variable:
With the data cleaning and feature engineering completed, an even closer examination of the data and its relations can be conducted using Pandas in conjunction with the Matplotlib library. The histogram below shows that the target variable, Item_Outlet_Sales, is right skewed:
Applying the .corr() function to the numerical features data frame created earlier, provides a symmetrical data frame of the variables’ correlations to each other. Plotting a heatmap of this correlation data frame highlights that Item_MRP has the strongest, positive correlation with the target variable while Item_Visibility has a negative correlation.
Using the .pairplot() function allows for visualization of relationships among the all the numeric variables at once:
The categorical variables’ relationships with the target variable can also be examined through the use of df.pivot_table(). This function operates like pivot tables in Excel by creating an index and applying an aggregation function over a specified value. Plotting these pivot tables then allows for easy trend visualization. From the Impact of Item Type chart, it appears that there are too many unique groups for a significant difference in sales to exist. The Impact of Outlet Size chart reveals that medium-sized stores have, on average, more sales.
This data analysis and exploration will aid in guiding feature, model, and parameter selection during the model development phase.
Finally, the creation of dummy variables is required in order to use the categorical variables in the modeling process. Pandas has an easy to use function, pd.get_dummies(), that converts each of the specified columns into binary variables based on their unique values. For instance, the Outlet_Size variable is now decomposed into three separate variables: Outlet_Size_High, Outlet_Size_Medium, Outlet_Size_Small.
With pre-processing, cleaning, and data exploration complete, the final phase of modeling can now take place. Sklearn is a commonly used machine learning library in Python that contains multiple modeling and evaluation tools. The first step is to enable the train_test_split() function of this package to divide the cleaned data frame into two separate data frames. The larger data frame, which will represent 85% of the entire data, will be used to train the model, while the remaining 15% will be used to evaluate and determine whether the model is appropriate.
Next, the train and test data frames are each divided into two separate data frames, one containing the desired predictors and the other containing the target variable:
Since the target variable is continuous, a simple, yet standard approach is to test a linear regression model. Once imported from the sklearn package, the function is applied to the train data using the model.fit() function. The predictions are then stored in an array using model.predict(). Model evaluation is conducted by using a variety of the metric functions from sklearn, along with plotting the actual vs. predicted values. From these results, it appears linear regression may not be the best model for this data.
The next step is to try a different model to see if it produces better results. XGBoost is an implementation of gradient boosted decision trees. It is a commonly applied model that is designed for speed and high performance. In order to use the XGBoost library in Databricks, it must be imported on the cluster running the notebook. This can be achieved by clicking on the desired cluster then under the libraries tab install new. The library source is PyPI and will appear as follows once installed:
As in the linear regression modeling, once the xgboost tool is installed and imported, it is applied to the train data and predictions are stored to evaluate the model. The metrics and plot both reveal higher performance than the linear regression model:
This model then needs to be evaluated against the test data to determine if it is, in fact, a good model. The test data reveals that there is still some relevance to the model, but further parameter tuning and possibly other model selections may lead to better results.
Overall, as evident by this demo, the actual fitting and tuning of a model is a small step compared to the entire machine learning process. In order to even obtain a data set for modeling, pre-preprocessing, data cleaning, feature engineering and exploration are all required. These stages are simplified and easily executable using the Pandas package. Pandas enables data scientists to not only understand and analyze the data themselves but also display the results of their analysis and modeling to their audience, making this library imperative for successful machine learning in Python. Databricks ability to seamlessly run Pandas code without having to learn PySpark provides data scientists with an even greater array of tool sets and options to analyze data.