Customer Segmentation Report for Arvato Financial Services

Tim Loehr
10 min readMar 25, 2020

--

As I enrolled for the Udacity Data Science Nanodegree, I didn’t know where my journey is going to end. This time has now arrived after I chose the dataset provided by Arvato for this final project to graduate from the Nanodegree.

The dataset consists of four files:

  • Azdias (891221 rows, 366 columns)
  • Customers (191652 rows, 369 columns)
  • Mailout Train (42962 rows, 367 columns)
  • Mailout Test ()

Project Definition

This project is provided by Arvato Bertelsmann, a Supply Chain Management solution company located in Germany https://www.kaggle.com/c/udacity-arvato-identify-customers.

The project is about finding people in the German population, which is most likely responding to a targeted marketing campaign.

It is hosted on Kaggle.com as shown in the link above and a test set needs to be predicted and evaluated on Kaggle at the end of the project.

Problem Statement

The project contains four main parts:

Data Processing

This first part of the jupyter notebook involves the first analysis of the data. Which datasets are available, which columns exist, what do the columns mean? All of this question is answered here.

Customer Segmentation

The segmentation is responsible for distinguishing relationships between the general population and the customers of Arvato. The key point is to identify potential customers in the population to target them for a marketing campaign.

For achieving this goal, I make use of heatmaps and correlations between certain columns. I can determine the most important columns based on their deviation from each other.

Supervised Learning

Here another two data-files come into play. After carefully analyzing which people are better to receive the marketing, the new file reveals information about the response of the customer after receiving the campaign.

Kaggle Competition

A prediction for the test data file will be submitted to Kaggle for an online evaluation of the accuracy.

Metric

The AUC — ROC curve measures the performance for the classification problems at various threshold settings.
ROC is a probability curve and AUC represents the degree or measure of separability. It reveals how much a model is capable of distinguishing between classes.

The training dataset for the supervised learning model consists of 98,7% of class 1 and 1,3% of class 2. For that reason, the ROC evaluation metric is used in this project to identify theaccuracy.

Methodology

Data

Arvato provides in total six files. Two of the files are Excel spreadsheets that contain further information about the columns in the other four CSV files.

CSV Files

  • AZDIAS: Contains 891211 rows, where each row represents a single person of the general German population and 366 columns, where each column represents a different features of a person
  • CUSTOMERS: Contains 191652 rows, where each row contains data for a customer of Arvato and 369 features of the customers, with an additional three columns which include information about their purchasing habits. The other 366 columns are the same as AZDIAS columns.
  • MAILOUT_TRAIN: Contains 42982 rows, where each row represents a customer of Arvato and 367, where 366 are the same features as AZDIAS, plus the additional label column that is RESPONSE (either 0 or 1).
  • MAILOUZ_TEST: Contains 42833 rows and 366 rows without the label features as MAILOUT_TRAIN. The Kaggle task is to predict the RESPONSE label und submit it to Kaggle for evaluating the results with the ROC metric.

Excel Files

  • DIAS_attributes.xlsx: Contains information about all the features of the dataset
  • DIAS_information.xlsx: Contains the information of the meaning behind most of the numerical values in the dataset, plus additionally the features information from the DIAS_attributes sheet.

Missing Features

At first, to get in touch with the data, I tried to compute the missing values within the dataset. The top features which are entirely useless are the ages of the children. Both the AZDIAS and the CUSTOMERS dataset appear to have the same relation between missing features. From all of the 366 and 369 features, only 140 features have less than 10% missing values. Most features have around 15% missing values and the top 16 features have 28% or more missing values.

Features Description

The dataset Azdias includes data about all kinds of different people in the German population. There are 366 columns, so there are quite a lot of features that need to be taken care of.

Customers, as the name suggests, contains only information about the customers of Arvato. This dataset contains three additional features, namely:

  • Online Purchase [yes, no]
  • Product Group [cosmetic and food, food, cosmetic]
  • Customer Group [multi buyer, single buyer]

The other 366 features are split into multiple categories:

  • Information level
  • Person
  • Household
  • 125m x 125m Grid
The three CUSTOMERS specific features

These are the most crucial topics, which include the most important features as shown later in the post.

Customer Segmentation Report

Based on the data, I could conclude the following analysis of a person with the mentioned topics of the features.

Person Topic

  • The average age of Arvato customers is 70 years old. Whereas the overall average age of the population dataset is around 50 years old people. They are most likely cutural driven elderly.
  • The customers are more likely to have their teens in the “60ies — economic miracle (Mainstream, O+W)”, whereas the median of the entire German population had their teens in the “80ies — Generation Golf (Mainstream, W)” of this dataset.
  • The average Arvato customers have a saving behavior of saving “very high”, whereas the wide population is only an “average” savor. This could lead to the conclusion Arvato is targeting people who potentially have more money on the bank.
  • Additionally, customers have a median value of financial investors to be very high with an overall median of average behavior. So Arvato customers are more likely to invest money.
  • Another small deviation is financial minimalism. Arvato customers appear to be less financially minimalistic than the regular population.
    Customers have, according to the dataset, a lower affinity for a sensual mindset.
  • Customers furthermore have a slight affinity for being traditionally minded.

Household Topic

  • The customers have a median consumption type value of “Versatile” instead of the wider population, which has a consumption type of “Family” This leads to the conclusion that Arvato is not really after families, but wants to send ads to people with a versatile lifestyle.

125m x 125m Grid Topic

  • The average customer is a Buyer > 24 months, which means that the transactional activity based on the product group ALL OTHER CATEGORIES is way higher than the average population, which is averagely a Singlebuyer 0–12 months.
  • Furthermore, the customers of Arvato have a median value of buying books in the last 13–24 months, whereas the overall population didn’t buy so many books. This could be an indicator of a more academic target.
  • A very interesting point is that the target for customers is people with transactional activity based on the product group COMPLETE MAIL-ORDER OFFERS is Multi-/Doublebuyer 13–24 months, whereas the entire population’s behavior, is unknown or very low. The target is people who bought things for a period of at least two years.

Specific Customer Behaviour

  • Product Group: This is very equally distributed between food and cosmetics. There is no clear tendency that either women or men are more likely to be customers.
  • Online Purchase: This is undeniably the case the most customers (91%) are buying things online. So a digital marketing campaign for online purchasers makes a lot of sense.
  • Customer Group: This is most often multi buyers (69%) and only (31%) are single buyers. Since there is no further information about this column, I assume it means that people who buy at Arvato are (69%) people who already bought something in the past from Arvato. This was already shown above in “125m x 125m Grid Topic”, that Arvato customers frequently multi-buy things online.

Heatmaps to evaluate the Correlation

Finding Correlations

The following picture shows how I found out the most imporant features that point out the differences between the AZDIAS and the CUSTOMERS data. columns starting with the prefix a indicate to be AZDIAS data and columns starting with the prefix c represent CUSTOMER data.

Find differences in the Datasets

Heatmap for the Customer dataset

Customer dataset

Heatmap for the Azdias (general population) dataset

Azdias dataset

Comparing these two correlation plots, it can be deduced that certain features occur mostly for the customers of Arvato:

Customer Finance Features

Azdias Finance Features

This clearly shows that the customers have a high tendency to save up more money, are financially more stable and prepare better for the future regarding savings.

Customer Mind Affinity Features

Azdias Mind Affinity Features

This features represent in what way the person is dutyfull traditional minded and shows in general the tendency in which the people think. It is explained in the dataset what the values mean. These values show a way more traditional thinking customer in comparison the wider population

So, the typical Arvato customer is…

The typical customers of Arvato are between 60 and 70 years old. They are financially very stable and saved quite some money. They did precaution for the higher ages in which they are currently in, because they grew up in a time where saving money was essential and pure consumption not normal. So they earned a lot money in the golden 70s to 90s and discovered now the online shopping. Since 91% of all purchases were made online, the customers are still young enough to handle the computer well. Most customers have a affinity to traditional thinking and are more rational and probably more academically educated than the average. For that reason, the most important features are:

Information level

  • ‘AGER_TYP’

Person

  • ‘GEBURTSJAHR’ (Birth year)
  • ‘PRAEGENDE_JUGENDJAHRE’ (Important teen years)
  • ‘ALTERKATEGORIE_GROB’ (Age category approximately)
  • ‘SEMIO_LUST’
  • ‘SEMIO_PFLICHT’
  • ‘SEMIO_RAT’
  • ‘SEMIO_TRADV’
  • ‘FINANZ_SPARER’ (Finance Saver)
  • ‘FINANZ_ANLEGER’ (Finance Investor)
  • ‘FINANZ_VORSORGER’ (Finance Provision)
  • ‘FINANZ_MINIMALIST’ (Finance Minimalist)

Household

  • ‘D19_KONSUMTYP_MAX’ (Consumer type)
  • ‘D19_VERSAND_OFFLINE_DATUM’ (Shipping Offline Date)
  • ‘D19_GESAMT_OFFLINE_DATUM’ (Total Offline Date)

125m x 125m Grid

  • ‘D19_SAMMELARTIKEL’ (Collecting articles)
  • ‘D19_KOSMETIK’ (Cosmetics)
  • ‘D19_BEKLEIDUNG_REST’ (Clothing)
  • ‘D19_BEKLEIDUNG_GEH’ (Clothing)
  • ‘D19_RATGEBER’ (Advisor)
  • ‘D19_SONSTIGE’ (other)
  • ‘D19_BUCH_CD’ (Book, CD)
  • ‘D19_REISEN’ (Journeys)

Supervised Learning Model

Regarding these features, we now want to build a model for estimating which customers potentially respond to Arvatos marketing campaign.

For this Mailout Train dataset, there is an additional column for training the model, namely the Response features which is either 0 or 1.

The distribution in the dataset is

  • 98,7% : 0
  • 1.3% : 1

So we can see, most people don’t reply. For training the model I only used the above-mentioned features, 43 in total. I filled up missing birth years according to the AGER_TYP features if it is available, which indicates in which time period (the 40s, 50s, 60s) the customer grew up. Same if this AGER_TYP was not available, but the birth year, I filled it up accordingly.

For building a supervised model, we need to split data into training and testing data sets to train on the training data set and to make a prediction on the test data set. I will make an evaluation on the training data by splitting it up into training and validation data to get a first impression of which kind of result to expect when I submit the TEST_MAILOUT to Kaggle.

  • I decided to use the basic train_test_split method from sklearn with a test data size of 20%.

For evaluating which classifier is the best one, I trained a Pipeline with Grid-Search:

pipe = Pipeline([('classifier', RandomForestClassifier())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
'classifier__penalty': ['l2'],
'classifier__C': np.logspace(0, 5)},
{'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100],
'classifier__max_features': [1, 3]},
{'classifier': [GradientBoostingRegressor(random_state=42)],
'classifier__n_estimators': [50, 100, 200],
'classifier__min_samples_split': [2, 3, 4]}]

I filled these values into a Grid-Search-CV model and ended up with the

GradientBoostRegressor to be the best classifier for this task

Classifier Parametrization

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
init=None, learning_rate=0.1, loss='ls', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_iter_no_change=None, presort='deprecated',
random_state=42, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)

Since the score is to a very high percentage of 0’s, a mean accuracy is not good, because in this way an accuracy of 98% can be easily achieved by only predicting 0’s. In this way, the metric AUC-Score is used.

My training score after splitting the data was 78,62%

I used the roc_auc_score method from sklearn to achieve this result.

Kaggle Competition

I predicted now on the MAILOUT_TEST dataset with this GradienBoostRegressor. The computed roc score from Kaggle achieved a score of

Score after uploading it to Kaggle was 73,33%

Conclusion

It was great fun doing the Data Science Nanodegree at Udacity and It was great fun to do this Capstone project. I learned a lot during this project and the ways in which it can be solved are endless. For this reason, I really like the Data Science field.

  • The first part included data exploration and data preprocessing. This was definitely one the most difficult parts of the project because the 366 features where sometimes confusing and a lot of the data was missing. I needed to decide which columns to include in the final model.
  • The Gradient Boost Regressor appeared to be the best classifier. The parameters were chosen by the GridSearchCV Pipeline after running approximately 25 minutes on my computer with a relatively old graphics-card. My result of 73,33% at the end is not one of the best, but also not one of the worst results on the Kaggle Leaderboard.

Thank you for reading my way to tackle this task.

--

--