Understanding Gower Distance for Mixed Data Types in Machine Learning

In data science and machine learning, accurately measuring the similarity or dissimilarity between data points is crucial for a variety of tasks such as clustering, classification, and recommendation. When working with datasets that contain a mix of numerical and categorical data, finding a distance metric that accommodates both data types becomes a challenge. This is where Gower Distance shines as an effective solution.
In this article, I’ll dive into the concept of Gower Distance, how it works with both numerical and categorical data, and provide a hands-on demonstration of how to implement it using Python.
What is Gower Distance?
Gower Distance is a similarity measure that calculates the dissimilarity between two data points based on a combination of numerical and categorical attributes. Unlike traditional distance metrics such as Euclidean distance, which is designed to handle only numerical data, Gower Distance is well-suited for datasets that include both types of attributes. It computes a distance matrix that quantifies how different two data points are, where a smaller distance indicates that the two points are more similar.
Why Gower Distance?
Traditional metrics like Euclidean Distance cannot handle categorical data appropriately. For instance, in categorical attributes like gender or interest (e.g., Male vs Female, Math vs Science), there is no inherent “distance” between categories. Gower Distance accounts for this by providing a simple yet powerful way to calculate distances that include both types of data in a unified framework.
How Does Gower Distance Work?
Formula for Gower Distance
Gower Distance combines different distance metrics for each attribute type and calculates the overall distance between two points as an average of individual attribute distances.

Example Scenario: A Dataset with Mixed Data Types
Let’s walk through an example where we have a dataset containing 3 individuals with the following attributes:

Our goal is to calculate the Gower distance between each pair of individuals.
Step-by-Step Calculation of Gower Distance
- Numerical Distance Calculation (Age)
For the Age attribute, we calculate the normalized difference between the individuals:

2. Categorical Distance Calculation (Gender and Interest)
For the Gender and Interest attributes, the distance is either 0 (if values are the same) or 1 (if values are different):

3. Final Gower Distance Calculation
Now, we calculate the final Gower distance for each pair:

Python Code for Gower Distance Calculation
Let’s implement the Gower Distance calculation using Python and the gower
library:
! pip install gower
import gower
import pandas as pd
# Sample data (numerical and categorical)
data = {
'Age': [25, 30, 28],
'Gender': ['Male', 'Female', 'Male'],
'Interest': ['Math', 'Science', 'Math']
}
# Convert data to DataFrame
df = pd.DataFrame(data)
# Calculate Gower distance matrix
distance_matrix = gower.gower_matrix(df)
# Display the distance matrix
print("Gower Distance Matrix:\n")
print(distance_matrix)
Output:
Gower Distance Matrix:
[[0. 0.99999994 0.19999997]
[0.99999994 0. 0.8 ]
[0.19999997 0.8 0. ]]
Visualizing the Gower Distance Matrix
We can visualize the distance matrix using a heatmap to better understand the distances between individuals. Here’s the Python code to generate the heatmap:
import seaborn as sns
import matplotlib.pyplot as plt
# Plot the Gower distance matrix as a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(distance_matrix, annot=True, cmap="Blues", xticklabels=["1", "2", "3"], yticklabels=["1", "2", "3"])
plt.title("Gower Distance Matrix")
plt.show()

Heatmap Visualization
The heatmap visually represents the Gower distances, where lighter colors indicate smaller distances (more similar individuals) and darker colors indicate larger distances (more dissimilar individuals).
Use Cases of Gower Distance
Gower distance is highly useful in several applications:
- Clustering: Gower distance can be used with clustering algorithms like K-means or DBSCAN, particularly when datasets contain both numerical and categorical features.
- Recommendation Systems: For recommendation systems, Gower distance can help calculate similarities between users or items based on both numerical ratings and categorical attributes (such as genre or category).
- Customer Segmentation: Gower distance is valuable for segmenting customers based on a mix of demographic (numerical) and behavioral (categorical) data.
- Survey Data Analysis: For survey data that includes both numerical responses (e.g., rating scales) and categorical choices (e.g., preferences), Gower distance can be used to analyze how respondents relate to one another.
Conclusion
Gower Distance is a powerful metric for measuring dissimilarity between individuals in datasets with mixed data types. It provides a robust solution where traditional metrics fall short, making it ideal for tasks such as clustering, recommendation systems, and customer segmentation. By handling both numerical and categorical variables, it ensures that all aspects of the data are taken into account.
The Python implementation we demonstrated offers a straightforward way to compute Gower distance, making it easy to apply in a variety of machine learning tasks. Whether you are analyzing customer data or building a recommendation system, Gower Distance can be an essential tool in your data science toolkit.
References:
- Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857–871.
- Python gower library: https://pypi.org/project/gower/