Introduction

This report provides an in-depth analysis of customer purchase behaviors, with a focus on demographic features such as age and marital status. By analyzing customer data, we can uncover patterns and insights that drive marketing strategies and personalize customer experiences.

 

Objectvies

The primary goal of this analysis is to:

  • Conduct Univariate Analysis on customer demographics such as age, marital status, education, etc.
  • Analyze Multivariate relationships, focusing on how different factors like income, age, marital status, etc., relate to spending behaviors.
  • Understand the impact of having children, education level, and campaign interactions on customer spending.

 

Data Overview

Let’s begin by taking a general look at the structure and summary of the data.

csvData <- read_delim("marketing_campaign.csv", delim = "\t")
## Rows: 2240 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (3): Education, Marital_Status, Dt_Customer
## dbl (26): ID, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntF...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
csvData <- csvData %>%
  mutate(Age = as.numeric(format(Sys.Date(), "%Y")) - Year_Birth)

cat("Shape of the DataFrame is:", dim(csvData), "\n\n")
## Shape of the DataFrame is: 2240 30
cat("Summary of the DataFrame:\n")
## Summary of the DataFrame:
str(csvData)
## tibble [2,240 × 30] (S3: tbl_df/tbl/data.frame)
##  $ ID                 : num [1:2240] 5524 2174 4141 6182 5324 ...
##  $ Year_Birth         : num [1:2240] 1957 1954 1965 1984 1981 ...
##  $ Education          : chr [1:2240] "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr [1:2240] "Single" "Single" "Together" "Together" ...
##  $ Income             : num [1:2240] 58138 46344 71613 26646 58293 ...
##  $ Kidhome            : num [1:2240] 0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome           : num [1:2240] 0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer        : chr [1:2240] "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency            : num [1:2240] 58 38 26 26 94 16 34 32 19 68 ...
##  $ MntWines           : num [1:2240] 635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits          : num [1:2240] 88 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts    : num [1:2240] 546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts    : num [1:2240] 172 2 111 10 46 0 50 3 3 1 ...
##  $ MntSweetProducts   : num [1:2240] 88 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds       : num [1:2240] 88 6 42 5 15 14 27 23 2 13 ...
##  $ NumDealsPurchases  : num [1:2240] 3 2 1 2 5 2 4 2 1 1 ...
##  $ NumWebPurchases    : num [1:2240] 8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: num [1:2240] 10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : num [1:2240] 4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : num [1:2240] 7 5 4 6 5 6 6 8 9 20 ...
##  $ AcceptedCmp3       : num [1:2240] 0 0 0 0 0 0 0 0 0 1 ...
##  $ AcceptedCmp4       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Complain           : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : num [1:2240] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : num [1:2240] 11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : num [1:2240] 1 0 0 0 0 0 0 0 1 0 ...
##  $ Age                : num [1:2240] 66 69 58 39 42 56 52 38 49 73 ...
cat("\nColumns in DataFrame:\n")
## 
## Columns in DataFrame:
print(names(csvData))
##  [1] "ID"                  "Year_Birth"          "Education"          
##  [4] "Marital_Status"      "Income"              "Kidhome"            
##  [7] "Teenhome"            "Dt_Customer"         "Recency"            
## [10] "MntWines"            "MntFruits"           "MntMeatProducts"    
## [13] "MntFishProducts"     "MntSweetProducts"    "MntGoldProds"       
## [16] "NumDealsPurchases"   "NumWebPurchases"     "NumCatalogPurchases"
## [19] "NumStorePurchases"   "NumWebVisitsMonth"   "AcceptedCmp3"       
## [22] "AcceptedCmp4"        "AcceptedCmp5"        "AcceptedCmp1"       
## [25] "AcceptedCmp2"        "Complain"            "Z_CostContact"      
## [28] "Z_Revenue"           "Response"            "Age"

 

Numeric Summary

Now, we focus on summarizing the numeric attributes of the customers in the dataset.

csvData %>% 
  select_if(is.numeric) %>% 
  summary() %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "left") %>%
  scroll_box(width = "100%", height = "500px")
ID Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response Age
Min. : 0 Min. :1893 Min. : 1730 Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :3 Min. :11 Min. :0.0000 Min. : 27.00
1st Qu.: 2828 1st Qu.:1959 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11 1st Qu.:0.0000 1st Qu.: 46.00
Median : 5458 Median :1970 Median : 51382 Median :0.0000 Median :0.0000 Median :49.00 Median : 173.50 Median : 8.0 Median : 67 Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000 Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000 Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000 Median :3 Median :11 Median :0.0000 Median : 53.00
Mean : 5592 Mean :1969 Mean : 52247 Mean :0.4442 Mean :0.5062 Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 167 Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325 Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317 Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429 Mean :0.01339 Mean :0.009375 Mean :3 Mean :11 Mean :0.1491 Mean : 54.19
3rd Qu.: 8428 3rd Qu.:1977 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11 3rd Qu.:0.0000 3rd Qu.: 64.00
Max. :11191 Max. :1996 Max. :666666 Max. :2.0000 Max. :2.0000 Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725 Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000 Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :3 Max. :11 Max. :1.0000 Max. :130.00
NA NA NA’s :24 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

 

Missing Data Analysis

Understanding missing data is crucial. It helps us to decide whether we can ignore or we need to handle these missing values.

suppressPackageStartupMessages(library(scales))

missing_data <- function(data) {
  total <- sum(is.na(data))
  percentage <- mean(is.na(data)) * 100
  tibble(Total = total, Percentage = percentage)
}

missing_values <- csvData %>%
  summarise_all(~sum(is.na(.))) %>%
  gather(key = "Variable", value = "Total") %>%
  mutate(Percentage = Total / nrow(csvData) * 100)

missing_values %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "left") %>%
  scroll_box(width = "40%", height = "500px")
Variable Total Percentage
ID 0 0.000000
Year_Birth 0 0.000000
Education 0 0.000000
Marital_Status 0 0.000000
Income 24 1.071429
Kidhome 0 0.000000
Teenhome 0 0.000000
Dt_Customer 0 0.000000
Recency 0 0.000000
MntWines 0 0.000000
MntFruits 0 0.000000
MntMeatProducts 0 0.000000
MntFishProducts 0 0.000000
MntSweetProducts 0 0.000000
MntGoldProds 0 0.000000
NumDealsPurchases 0 0.000000
NumWebPurchases 0 0.000000
NumCatalogPurchases 0 0.000000
NumStorePurchases 0 0.000000
NumWebVisitsMonth 0 0.000000
AcceptedCmp3 0 0.000000
AcceptedCmp4 0 0.000000
AcceptedCmp5 0 0.000000
AcceptedCmp1 0 0.000000
AcceptedCmp2 0 0.000000
Complain 0 0.000000
Z_CostContact 0 0.000000
Z_Revenue 0 0.000000
Response 0 0.000000
Age 0 0.000000
ggplot(missing_values, aes(x = reorder(Variable, -Total), y = Total)) +
  geom_bar(stat = "identity", fill = "grey") +
  scale_y_continuous(labels = comma) +
  labs(x = "Variable", y = "Total Missing Values", title = "Missing Data in Each Column") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5))

 

Income Data Handling

Here we handle missing data within the ‘Income’ column by imputing missing values with the median income.

names(csvData) <- trimws(names(csvData))

median_income <- median(csvData$Income, na.rm = TRUE)
  
print(median_income)
## [1] 51381.5
csvData$Income <- ifelse(is.na(csvData$Income), median_income, csvData$Income)

any_na_present <- anyNA(csvData)
print(any_na_present)
## [1] FALSE

 

Duplicate Records Analysis

Before diving deep into our dataset, it’s crucial to identify and handle duplicate entries. Duplicates can lead to skewed analysis results by over-representing certain data points.

duplicate_rows <- csvData[duplicated(csvData), ]

print(duplicate_rows)
## # A tibble: 0 × 30
## # ℹ 30 variables: ID <dbl>, Year_Birth <dbl>, Education <chr>,
## #   Marital_Status <chr>, Income <dbl>, Kidhome <dbl>, Teenhome <dbl>,
## #   Dt_Customer <chr>, Recency <dbl>, MntWines <dbl>, MntFruits <dbl>,
## #   MntMeatProducts <dbl>, MntFishProducts <dbl>, MntSweetProducts <dbl>,
## #   MntGoldProds <dbl>, NumDealsPurchases <dbl>, NumWebPurchases <dbl>,
## #   NumCatalogPurchases <dbl>, NumStorePurchases <dbl>,
## #   NumWebVisitsMonth <dbl>, AcceptedCmp3 <dbl>, AcceptedCmp4 <dbl>, …
num_duplicate_rows <- nrow(duplicate_rows)
cat("Number of duplicate rows: ", num_duplicate_rows, "\n")
## Number of duplicate rows:  0
num_unique_values <- sapply(csvData, function(x) length(unique(x)))

unique_values_df <- as.data.frame(num_unique_values)

names(unique_values_df) <- c("UniqueValues")

print(unique_values_df)
##                     UniqueValues
## ID                          2240
## Year_Birth                    59
## Education                      5
## Marital_Status                 8
## Income                      1975
## Kidhome                        3
## Teenhome                       3
## Dt_Customer                  663
## Recency                      100
## MntWines                     776
## MntFruits                    158
## MntMeatProducts              558
## MntFishProducts              182
## MntSweetProducts             177
## MntGoldProds                 213
## NumDealsPurchases             15
## NumWebPurchases               15
## NumCatalogPurchases           14
## NumStorePurchases             14
## NumWebVisitsMonth             16
## AcceptedCmp3                   2
## AcceptedCmp4                   2
## AcceptedCmp5                   2
## AcceptedCmp1                   2
## AcceptedCmp2                   2
## Complain                       2
## Z_CostContact                  1
## Z_Revenue                      1
## Response                       2
## Age                           59

 

Initial Data Overview

After cleaning the dataset, let’s take a look at the first few entries. This will provide a snapshot of the data we’re working with and help verify the structure and key fields after preprocessing steps.

csvData <- csvData %>% select(-Z_CostContact, -Z_Revenue)

head(csvData, 3) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "left") %>%
  scroll_box(width = "100%", height = "170")
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age
5524 1957 Graduation Single 58138 0 0 04-09-2012 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 66
2174 1954 Graduation Single 46344 1 1 08-03-2014 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0 69
4141 1965 Graduation Together 71613 0 0 21-08-2013 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0 58

 

Univariate Analysis

 

Age Distribution

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(grid))

p1 <- ggplot(csvData, aes(y = Age)) + 
  geom_boxplot(fill = "skyblue") + 
  labs(title = "Age Distribution with Outliers", y = "Age") + 
  theme_minimal()+
  theme(
     panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank()
  )

csvData <- csvData %>% filter(Age <= 100)

p2 <- ggplot(csvData, aes(y = Age)) + 
  geom_boxplot(fill = "lightgreen") + 
  labs(title = "Age Distribution without Outliers", y = "Age") + 
  theme_minimal() + 
  theme (
     panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank()
  )

grid.arrange(p1, p2, ncol = 2)

 

Now, let’s calculate the age of each customer and create a bar plot that showcases the current overall distribution.

summary_without_outliers <- summary(csvData %>% filter(Age <= 100) %>% .$Age)

print(summary_without_outliers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    27.0    46.0    53.0    54.1    64.0    83.0
age_bins <- seq(from = floor(min(csvData$Age)/5)*5, to = ceiling(max(csvData$Age)/5)*5, by = 5)

csvData <- csvData %>%
  mutate(Age_Group = cut(Age, breaks = age_bins, include.lowest = TRUE, right = FALSE, labels = paste(age_bins[-length(age_bins)], age_bins[-1]-1, sep="-")))

age_distribution_grouped <- csvData %>%
  count(Age_Group) %>%
  arrange(Age_Group) 

ggplot(age_distribution_grouped, aes(x = Age_Group, y = n)) +
  geom_bar(aes(fill = n), stat = "identity", width = 0.7) +
  geom_text(aes(label = n), position = position_stack(vjust = 0.5), size = 3.5, color = "white") +
  labs(title = "Customer Age Distribution",
       x = "Age Group",
       y = "Count of Individuals",
       fill = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
        axis.title.x = element_text(size = 12, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.title = element_text(size = 12, face = "bold")) +
  scale_fill_gradient(low = "lightsteelblue", high = "blueviolet", name = "Frequency") 

 

Observation: The customer base presents a mature age profile with the median age at 53, indicating that the majority of customers are likely to be in a well-established stage of life. The age range is broad, extending from 27 to 83 years, with a concentration between 46 and 64, suggesting that products and services should be targeted towards the needs of middle-aged to senior adults.

 

Marital Status

csvData <- csvData %>%
  mutate(Marital_Status = case_when(
    Marital_Status %in% c('Married', 'Together') ~ 'Relationship',
    Marital_Status %in% c('Divorced', 'Widow', 'Alone', 'YOLO', 'Absurd') ~ 'Single',
    TRUE ~ Marital_Status
  ))

pie_chart <- csvData %>%
  count(Marital_Status) %>%
  mutate(Labels = paste0(Marital_Status, "\n", scales::percent(n/sum(n)))) %>%
  ggplot(aes(x = "", y = n, fill = Marital_Status, label = Labels)) +
  geom_bar(width = 1, stat = "identity") +
  geom_text(aes(label = Labels), position = position_stack(vjust = 0.5)) +
  coord_polar("y", start = 0) +
  labs(title = "Proportion by Marital Status", x = NULL, y = NULL) +
  theme_void() +
  scale_fill_manual(values = colorRampPalette(c("lightsteelblue", "blueviolet"))(length(unique(csvData$Marital_Status)))) +
  theme(legend.position = "none",  panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)

    )


grid.arrange(pie_chart, nrow = 1)

 

Observation: Customers are predominantly in a relationship, with 65% indicating a marital status of ‘Married’ or ‘Together’. This suggests a customer base that may have considerations for family or partners in their purchasing decisions. In contrast, the single customers, including ‘Divorced’, ‘Widow’, ‘Alone’, ‘YOLO’, and ‘Absurd’, make up 35%, which may indicate a significant market for individual-centered products and services

 

Kidhome & Teenhome

KidHome & TeenHome

children_home_summary <- csvData %>%
  mutate(Children_at_Home = if_else(Kidhome + Teenhome > 0, "Yes", "No")) %>%
  count(Children_at_Home) %>%
  mutate(Percentage = n / sum(n) * 100) 

children_home_chart <- ggplot(children_home_summary, aes(x = Children_at_Home, y = n, fill = Children_at_Home)) +
  geom_bar(stat = "identity") +
  labs(title = "Customers with Children at Home", x = "Children at Home", y = "Number of Customers") +
  theme_minimal() +
  scale_fill_manual(values = c("Yes" = "lightblue", "No" = "salmon")) +
  theme(legend.position = "none", 
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank(),
        plot.title = element_text(hjust = 0.5))

print(as.data.frame(children_home_summary), row.names = FALSE)
##  Children_at_Home    n Percentage
##                No  637   28.47564
##               Yes 1600   71.52436
grid.arrange(children_home_chart) 

csvData <- csvData %>%
  mutate(TotalKids = Kidhome + Teenhome)

kids_freq_summary <- csvData %>%
  count(TotalKids) %>%
  mutate(Percentage = n / sum(n) * 100) 

kids_freq_chart <- ggplot(kids_freq_summary, aes(x = as.factor(TotalKids), y = n, fill = as.factor(TotalKids))) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set3", direction = -1) + 
  labs(title = "Number of Children in Customers' Households",
       x = "Number of Children",
       y = "Number of Customers",
       fill = "Number of Children") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 0, hjust = 1),
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    legend.position = "none"
  )



print(as.data.frame(kids_freq_summary), row.names = FALSE)
##  TotalKids    n Percentage
##          0  637  28.475637
##          1 1126  50.335270
##          2  421  18.819848
##          3   53   2.369245

 

Observation: The majority of customers (71.5%) have children at home, with half of these families having one child (50.3%). Families with two or more children represent a smaller portion of the market, indicating a higher concentration of smaller families within the customer base.

 

Education

Education

education_chart <- csvData %>%
  count(Education) %>%
  ggplot(aes(x = reorder(Education, -n), y = n, fill = Education)) + 
  geom_bar(stat = "identity") +
  labs(title = "Customer Education Levels", x = "Education Level", y = "Number of Customers") +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel1") +
  theme(plot.title = element_text(hjust = 0.5), 
         panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

print(education_chart)

education_summary <- csvData %>%
  count(Education) %>%
  mutate(Percentage = n / sum(n) * 100)

print(as.data.frame(education_summary), row.names = FALSE)
##   Education    n Percentage
##    2n Cycle  201   8.985248
##       Basic   54   2.413947
##  Graduation 1127  50.379973
##      Master  370  16.540009
##         PhD  485  21.680823

 

Observation: The education level among customers is predominantly ‘Graduation’ at over 50%, indicating that the majority have completed a degree equivalent to a college education. Postgraduate degrees (Masters and PhD) are also significant, accounting for approximately 38% combined, which suggests a well-educated customer base. Only a small fraction have ‘Basic’ education at 2.4%, and ‘2n Cycle’ represents just under 9%  

Income

Income

common_theme <- theme(
  axis.title.x = element_blank(),
  axis.text.x = element_blank(),
  axis.ticks.x = element_blank(),
  axis.text.y = element_text(size = 15),  
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.border = element_blank(),  
  panel.background = element_blank(),  
  plot.background = element_blank() ,
  plot.title = element_text(hjust = 0.5)

)

bp_before <- ggplot(csvData, aes(y = Income)) + 
  geom_boxplot(fill = "lightcoral", color = "black", width = 0.7) + 
  theme_minimal() +
  common_theme + 
  ylab("Income") + 
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Before Outliers") 

csvData <- csvData %>% 
  filter(Income < 120000)

bp_after <- ggplot(csvData, aes(y = Income)) +
  geom_boxplot(fill = "lightskyblue", color = "black", width = 0.7) +  
  theme_minimal() +
  common_theme +
  labs(y = "Income", title = "After Outliers")  

grid.arrange(bp_before, bp_after, ncol = 2)

ggplot(csvData, aes(x = Income)) +
  geom_density(fill = "blueviolet", alpha = 0.5) +
  labs(x = "Income", title = "Income Distribution Density") +
  theme_light() +
  theme(text = element_text(size = 15) , 
    panel.border = element_blank(),  
    panel.background = element_blank(),
    panel.grid.major = element_blank(), 
    plot.title = element_text(hjust = 0.5),
    panel.grid.minor = element_blank())

income_summary <- csvData %>%
  mutate(Income_Range = cut(Income, breaks = seq(0, 120000, by = 20000), include.lowest = TRUE)) %>%
  count(Income_Range) %>%
  mutate(Percentage = n / sum(n) * 100)

print(as.data.frame(income_summary), row.names = FALSE)
##     Income_Range   n Percentage
##        [0,2e+04] 127  5.6976223
##    (2e+04,4e+04] 604 27.0973531
##    (4e+04,6e+04] 667 29.9237326
##    (6e+04,8e+04] 623 27.9497533
##    (8e+04,1e+05] 203  9.1072230
##  (1e+05,1.2e+05]   5  0.2243158

 

Observation: The majority of customers have incomes between 40,000 to 80,000, encompassing approximately 58% of the customer base, indicative of a strong middle-class presence. The 20,000 to 40,000 income range is also significant, accounting for 27% of customers. Notably, high earners with incomes between 80,000 to 100,000 constitute 9% of the customer base, while those earning over 100,000 are relatively rare at 0.22%. This suggests that luxury or high-end products may cater to a smaller segment of the market.

 

Expenses

Expenses

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(scales))

data_long <- melt(csvData, id.vars = "ID", 
                  measure.vars = c("MntWines", "MntFruits", "MntMeatProducts", 
                                   "MntFishProducts", "MntSweetProducts", "MntGoldProds"),
                  variable.name = "Category", value.name = "Expenses")

category_summary <- data_long %>%
  group_by(Category) %>%
  summarise(Total_Expenses = sum(Expenses), .groups = 'drop') %>%
  mutate(Percentage = (Total_Expenses / sum(Total_Expenses)) * 100)

  print(as.data.frame(category_summary), row.names = FALSE)
##          Category Total_Expenses Percentage
##          MntWines         679826  50.366771
##         MntFruits          58731   4.351247
##   MntMeatProducts         368418  27.295257
##   MntFishProducts          83905   6.216332
##  MntSweetProducts          60543   4.485494
##      MntGoldProds          98328   7.284899
p <- ggplot(category_summary, aes(x = Category, y = Total_Expenses, fill = Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_discrete() +
  scale_y_continuous(labels = label_number(accuracy = 0.1)) +
  labs(x = "Product Category", y = "Total Expenses") +
  theme_minimal() +
  theme(
    text = element_text(size = 12), 
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
    axis.title.y = element_text(vjust = 2),
    axis.text.y = element_text(angle = 0),
    panel.grid.major = element_blank(), 
      panel.grid.minor = element_blank(),
  legend.position = "none"

  )
print(p) 

 

Observation: Wines are the leading category in customer expenses, accounting for over 50% of the total, which may suggest a customer base with a strong preference for this product. Meat products also hold a significant share at 27%, while other categories like fruits, fish, and sweets each represent less than 7% of the total expenses.

 

Campaigns

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))


csvData <- csvData %>%
  mutate(TotalAcceptedCmp = AcceptedCmp1 + AcceptedCmp2 + AcceptedCmp3 + 
                            AcceptedCmp4 + AcceptedCmp5 + Response)

accepted_freq <- csvData %>%
  count(TotalAcceptedCmp) %>%
  mutate(Percentage = n / sum(n) * 100)  

print(as.data.frame(accepted_freq), row.names = FALSE)
##  TotalAcceptedCmp    n Percentage
##                 0 1621 72.7231943
##                 1  369 16.5545087
##                 2  142  6.3705698
##                 3   51  2.2880215
##                 4   36  1.6150740
##                 5   10  0.4486317
ggplot(accepted_freq, aes(x = as.factor(TotalAcceptedCmp), y = n, fill = as.factor(TotalAcceptedCmp))) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_d(direction = -1) +  
  labs(title = "Frequency of Accepted Offers per Customer",
       x = "Total Number of Accepted Offers",
       y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust=0.5),
    axis.title = element_text(size = 12),
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(hjust = 1),
    legend.position = "none"
  )

 

Observation: A vast majority of customers (approximately 73%) did not accept any campaign offers, which could indicate a challenge in the effectiveness of the campaigns or a generally low propensity to respond. However, there is a small segment (about 17%) that engaged with one campaign, potentially representing a more responsive or interested customer group.

 

Purchase Places

purchase_data_long <- csvData %>%
  select(NumWebPurchases, NumCatalogPurchases, NumStorePurchases) %>%
  pivot_longer(
    cols = everything(),
    names_to = "PurchasePlace",
    values_to = "NumberOfPurchases"
  ) %>%
  group_by(PurchasePlace) %>%
  summarise(TotalPurchases = sum(NumberOfPurchases), .groups = 'drop') %>%
  mutate(Percentage = TotalPurchases / sum(TotalPurchases) * 100)  

print(as.data.frame(purchase_data_long), row.names = FALSE)
##        PurchasePlace TotalPurchases Percentage
##  NumCatalogPurchases           5877   21.01030
##    NumStorePurchases          12956   46.31775
##      NumWebPurchases           9139   32.67196
ggplot(purchase_data_long, aes(x = PurchasePlace, y = TotalPurchases, fill = PurchasePlace)) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Dark2") + 
  labs(title = "Purchases Made From Different Places",
       x = "",
       y = "Total Number of Purchases") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust= 0.5),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 12),
    axis.text.x = element_text(angle = 0, vjust = 1),
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),

    legend.position = "none" 
  )

 

Observation: Store purchases dominate the shopping venues, with nearly half of the purchases (46%) being made in-store. Web purchases account for roughly a third of the transactions, suggesting a significant online engagement, while catalog purchases are the least preferred method at 21%. This could indicate that despite the rise of digital platforms, physical stores remain a crucial point of sale.

 

Multivariate Analysis

 

Income vs Spendings

  csvData <- csvData %>%
    mutate(Total_Spendings = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds)
  
  
  correlation_income_spendings <- cor(csvData$Income, csvData$Total_Spendings, use = "complete.obs")
  cat("Correlation between Income and Total Spendings:", correlation_income_spendings, "\n")
## Correlation between Income and Total Spendings: 0.8202215
  p_income_spendings <- ggplot(csvData, aes(x = Income, y = Total_Spendings)) +
    geom_point(color = "blueviolet", alpha = 0.5) +  
    labs(
      title = "Income vs Spendings",
      x = "Income",
      y = "Spendings"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
      axis.title.x = element_text(size = 18),
      axis.title.y = element_text(size = 18),
      axis.text = element_text(size = 12),
      panel.grid.major = element_blank(), 
      panel.grid.minor = element_blank(),
      axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
      legend.position = "none"
    ) +
    scale_x_continuous(labels = dollar_format()) +
    scale_y_continuous(labels = dollar_format())
  
  print(p_income_spendings)

 

Observation: There is a strong positive correlation (0.82) between income and spending, suggesting that customers with higher incomes tend to spend more, which is indicative of significant purchasing power and could inform targeted marketing strategies.

 

Age vs Spendings

  correlation_age_spendings <- cor(csvData$Age, csvData$Total_Spendings, use = "complete.obs")
  cat("Correlation between Age and Total Spendings:", correlation_age_spendings, "\n")
## Correlation between Age and Total Spendings: 0.1160903
  p_age_spendings <- ggplot(csvData, aes(x = Age, y = Total_Spendings)) +
    geom_point(aes(color = Age), alpha = 0.6) + 
    geom_smooth(method = "lm", se = FALSE, color = "blue") +  
    scale_color_gradient(low = "yellow", high = "red") + 
    labs(
      title = "Age vs. Total Spendings",
      x = "Age (Years)",
      y = "Total Spendings (Last 2 Years)"
    ) +
    theme_minimal() +
    theme(
      panel.grid.major = element_blank(), 
      panel.grid.minor = element_blank(),
      plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
      axis.title = element_text(size = 16),
      legend.position = "none"
    )
  
  print(p_age_spendings)

 

Observation: The correlation (0.12) between age and spending is relatively weak, indicating that age is not a strong predictor of spending among customers, and marketing efforts may be better focused on other demographic factors.

 

Martial Status vs Spendings

avg_spendings_by_marital <- csvData %>%
  group_by(Marital_Status) %>%
  summarise(Average_Spendings = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(Marital_Status = factor(Marital_Status, levels = Marital_Status[order(-Average_Spendings)])) 

 print(as.data.frame(avg_spendings_by_marital), row.names = FALSE)
##  Marital_Status Average_Spendings
##    Relationship          596.2214
##          Single          622.4174
  ggplot(avg_spendings_by_marital, aes(x = Marital_Status, y = Average_Spendings, fill = Marital_Status)) +
    geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "Paired") + 
    labs(title = "Marital Status vs Expenses",
       x = "Marital Status",
       y = "Average Expenses") +
    theme_minimal() +
    theme(
       panel.grid.major = element_blank(),  
      panel.grid.minor = element_blank(),
      plot.title = element_text(size = 14, face = "bold", hjust=0.5),
      axis.title.x = element_text(size = 12),
      axis.title.y = element_text(size = 12),
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "none"  
    ) +
    scale_y_continuous(labels = scales::dollar) 

 

Observation: Interestingly, individuals with a status of ‘Single’ exhibit higher average spending (approximately $622) compared to those in a Relationship (nearly $596). This could suggest that single customers have more discretionary spending or different purchasing habits that favor more spending on the categories considered, perhaps due to fewer household obligations or different lifestyle choices.

 

How Having Kids Effect Expenses

csvData <- csvData %>%
  mutate(TotalKids = factor(Kidhome + Teenhome))

avg_spendings_by_kids <- csvData %>%
  group_by(TotalKids) %>%
  summarise(Average_Spendings = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(desc(Average_Spendings))

 print(as.data.frame(avg_spendings_by_kids), row.names = FALSE)
##  TotalKids Average_Spendings
##          0         1106.3712
##          1          473.2208
##          3          274.6038
##          2          246.2786
ggplot(avg_spendings_by_kids, aes(x = TotalKids, y = Average_Spendings, fill = TotalKids)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_fill_brewer(palette = "Spectral") + 
  labs(title = "Average Spendings by Number of Kids in Household",
       x = "Total Number of Kids",
       y = "Average Spendings") +
  theme_minimal() +
  theme(
     panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 14, face = "bold", hjust =0.5),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    axis.text.x = element_text(angle = 0, hjust = 1),
    legend.position = "none"  

  ) +
  scale_y_continuous(labels = scales::dollar_format()) 

 

Observation: Customers without children lead with the highest average spending, over $1100, potentially reflecting greater disposable income and freedom to spend on personal indulgences. The presence of children correlates with a marked decrease in spending, where those with one child spend about $473, and it further declines as the number of children increases. This highlights a clear opportunity for child-centric marketing strategies and the potential to cater to the distinct needs of childless households.

 

Education vs Income & Expenses

education_summary <- csvData %>%
  group_by(Education) %>%
  summarise(Average_Income = mean(Income, na.rm = TRUE),
            Average_Expenses = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup()

print(as.data.frame(education_summary), row.names = FALSE)
##   Education Average_Income Average_Expenses
##    2n Cycle       47681.40         501.0348
##       Basic       20306.26          81.7963
##  Graduation       51978.11         619.9537
##      Master       52612.67         613.2791
##         PhD       55180.67         668.3950
education_income_plot <- ggplot(csvData, aes(x = Education, y = Income, color = Education)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Income by Education Level", x = "", y = "Income") +
  theme_minimal() +
  theme(legend.position = "none", panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))

education_expenses_plot <- ggplot(csvData, aes(x = Education, y = Total_Spendings, color = Education)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Expenses by Education Level", x = "" , y = "Total Expenses") +
  theme_minimal() +
  theme(legend.position = "none", panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5))

grid.arrange(education_income_plot, education_expenses_plot, ncol = 2)

 

Observation: Individuals with a PhD not only exhibit the highest average income at approximately $55,181 but also the greatest average expenses, around $668, suggesting a correlation between educational attainment and spending capacity. In contrast, those with basic education have significantly lower average income and expenses, indicating that educational level is a strong predictor of economic behavior and potential in the marketplace.

 

Campaigns vs Expenses

campaign_data_long <- csvData %>%
  select(Total_Spendings, starts_with("AcceptedCmp"), Response) %>%
  pivot_longer(cols = starts_with("AcceptedCmp"), names_to = "Campaign", values_to = "Accepted") %>%
  mutate(Campaign = factor(Campaign, levels = c("AcceptedCmp1", "AcceptedCmp2", 
                                                "AcceptedCmp3", "AcceptedCmp4", 
                                                "AcceptedCmp5", "Response")))

campaign_summary <- campaign_data_long %>%
  group_by(Campaign, Accepted) %>%
  summarise(Average_Expenses = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup()

print(as.data.frame(campaign_summary), row.names = FALSE)
##      Campaign Accepted Average_Expenses
##  AcceptedCmp1        0         544.9933
##  AcceptedCmp1        1        1482.2222
##  AcceptedCmp2        0         595.9623
##  AcceptedCmp2        1        1307.6667
##  AcceptedCmp3        0         596.4681
##  AcceptedCmp3        1         720.5399
##  AcceptedCmp4        0         562.0024
##  AcceptedCmp4        1        1143.1257
##  AcceptedCmp5        0         526.4528
##  AcceptedCmp5        1        1614.6481
campaign_expenses_plot <- ggplot(campaign_data_long, aes(x = Campaign, y = Total_Spendings, fill = as.factor(Accepted))) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Effect of Campaign Acceptance on Customer Expenses", x = "", y = "Total Expenses", fill = "Accepted Offer") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5),  
        plot.title = element_text(size = 14, face = "bold", hjust=0.5),
        legend.title = element_text(size = 12),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12)) +
    scale_x_discrete()  


print(campaign_expenses_plot)

 

Observation: The data reveals a striking insight: customers who accepted campaign offers, on average, spent markedly more than those who did not. For instance, those accepting the first campaign spent an average of $1,482 compared to $545 for those who did not. This pattern is consistent across campaigns, with AcceptedCmp5 showing the most significant difference — those who accepted this campaign spent over three times more than those who did not. This underscores the effectiveness of marketing campaigns in driving higher expenditure among responsive customers.

 

Age vs Product Type

csvData <- csvData %>%
  mutate(Age_Group = case_when(
    Age >= 0 & Age <= 30   ~ "0-30",
    Age > 30 & Age <= 60   ~ "31-60",
    Age > 60 & Age <= 100  ~ "61-100",
    TRUE                   ~ "Out of range"
  ))

melted_data <- csvData %>%
  select(ID, Age_Group, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds) %>%
  pivot_longer(
    cols = starts_with("Mnt"),
    names_to = "Product_Type",
    values_to = "Amount_Spent"
  )

age_product_totals <- melted_data %>%
  group_by(Age_Group, Product_Type) %>%
  summarise(Total_Amount_Spent = sum(Amount_Spent), .groups = 'drop')

age_group_totals <- age_product_totals %>%
  group_by(Age_Group) %>%
  summarise(Total_Spent_by_Age_Group = sum(Total_Amount_Spent), .groups = 'drop')

age_product_percentages <- age_product_totals %>%
  left_join(age_group_totals, by = "Age_Group") %>%
  mutate(Percentage = Total_Amount_Spent / Total_Spent_by_Age_Group * 100) %>%
  select(Age_Group, Product_Type, Total_Amount_Spent, Total_Spent_by_Age_Group, Percentage)

print(age_product_percentages, row.names = FALSE)
## # A tibble: 18 × 5
##    Age_Group Product_Type   Total_Amount_Spent Total_Spent_by_Age_G…¹ Percentage
##    <chr>     <chr>                       <dbl>                  <dbl>      <dbl>
##  1 0-30      MntFishProduc…               1406                  14272       9.85
##  2 0-30      MntFruits                     649                  14272       4.55
##  3 0-30      MntGoldProds                 1042                  14272       7.30
##  4 0-30      MntMeatProduc…               5127                  14272      35.9 
##  5 0-30      MntSweetProdu…                691                  14272       4.84
##  6 0-30      MntWines                     5357                  14272      37.5 
##  7 31-60     MntFishProduc…              52515                 834177       6.30
##  8 31-60     MntFruits                   38408                 834177       4.60
##  9 31-60     MntGoldProds                63102                 834177       7.56
## 10 31-60     MntMeatProduc…             231124                 834177      27.7 
## 11 31-60     MntSweetProdu…              39275                 834177       4.71
## 12 31-60     MntWines                   409753                 834177      49.1 
## 13 61-100    MntFishProduc…              29984                 501302       5.98
## 14 61-100    MntFruits                   19674                 501302       3.92
## 15 61-100    MntGoldProds                34184                 501302       6.82
## 16 61-100    MntMeatProduc…             132167                 501302      26.4 
## 17 61-100    MntSweetProdu…              20577                 501302       4.10
## 18 61-100    MntWines                   264716                 501302      52.8 
## # ℹ abbreviated name: ¹​Total_Spent_by_Age_Group
ggplot(melted_data, aes(x = Age_Group, y = Amount_Spent, fill = Product_Type)) +
  geom_bar(stat = "summary", fun = "sum", position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Product Spending by Age Group",
       x = "Age Group",
       y = "Percentage of Total Spending",
       fill = "Product Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
         panel.grid.major = element_blank(),  
      panel.grid.minor = element_blank(),
        plot.title = element_text(size = 16, face = "bold",  hjust = 0.5),
        legend.title = element_text(size = 12, face = "bold"),
        legend.position = "bottom") +
  scale_fill_brewer(palette = "Set3")

 

Observation: The youngest age group (0-30) shows a balanced expenditure across different product types, with the highest percentages spent on meat products (35.9%) and wines (37.5%). For the middle-aged group (31-60), there’s a notable preference for wines, which account for nearly half of their spending at 49.1%. Seniors (61-100) also demonstrate a similar pattern, dedicating a significant 52.8% of their spending to wines, highlighting a consistent trend across age groups favoring this category.

 

Martial Status vs Purchase Place

csvData <- csvData %>%
  mutate(Marital_Simplified = if_else(Marital_Status %in% c("Single", "Alone", "Divorced", "Widow"), "Single", "In Relationship"))

marital_online_stats <- csvData %>%
  group_by(Marital_Simplified) %>%
  summarise(Average_Online_Purchases = mean(NumWebPurchases, na.rm = TRUE)) %>%
  ungroup()

print(as.data.frame(marital_online_stats), row.names = FALSE)
##  Marital_Simplified Average_Online_Purchases
##     In Relationship                 4.102368
##              Single                 4.095839
ggplot(csvData, aes(x = Marital_Simplified, y = NumWebPurchases, fill = Marital_Simplified)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Single" = "#1f77b4", "In Relationship" = "#ff7f0e")) +
  labs(title = "Web Purchases by Marital Status",
       x = "Marital Status",
       y = "Number of Web Purchases") +
  theme_minimal() +
  theme(legend.position = "none" , panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 14, face = "bold",  hjust = 0.5),

    )

 

Observation: When analyzing online purchasing habits, those in a relationship appear to shop marginally more online with an average of 4.10 purchases compared to their single counterparts at 4.09. This slight difference suggests that marital status has minimal impact on the frequency of online purchases, indicating similar digital shopping behaviors between these two demographic segments.

 

Having Kids vs Purchase Place

csvData <- csvData %>%
  mutate(Has_Kids = if_else(Kidhome + Teenhome > 0, "Has Kids", "No Kids"))

kids_online_stats <- csvData %>%
  group_by(Has_Kids) %>%
  summarise(Average_Online_Purchases = mean(NumWebPurchases, na.rm = TRUE)) %>%
  ungroup()

print(as.data.frame(kids_online_stats), row.names = FALSE)
##  Has_Kids Average_Online_Purchases
##  Has Kids                 3.972431
##   No Kids                 4.421801
ggplot(csvData, aes(x = Has_Kids, y = NumWebPurchases, fill = Has_Kids)) +
  geom_boxplot() +
  scale_fill_manual(values = c("No Kids" = "#2ca02c", "Has Kids" = "#d62728")) +
  labs(title = "Web Purchases by Parental Status",
       x = "Parental Status",
       y = "Number of Web Purchases") +
  theme_minimal() +
  theme(legend.position = "none",  panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 14, face = "bold",  hjust = 0.5),)

 

Observation: Interestingly, individuals without kids tend to make more online purchases (an average of 4.42) compared to those with kids (an average of 3.97). This could suggest that the time constraints and responsibilities of parenthood possibly influence the lower frequency of online shopping among parents.

 

Age Group vs Purchase Place

csvData <- csvData %>%
  mutate(Age = 2023 - Year_Birth,
         Age_Group = case_when(
           Age <= 30 ~ "0-30",
           Age > 30 & Age <= 60 ~ "31-60",
           Age > 60 ~ "61-90",
           TRUE ~ "Unknown"
         ))

age_online_stats <- csvData %>%
  group_by(Age_Group) %>%
  summarise(Average_Online_Purchases = mean(NumWebPurchases, na.rm = TRUE)) %>%
  ungroup()

print(as.data.frame(age_online_stats), row.names = FALSE)
##  Age_Group Average_Online_Purchases
##       0-30                 3.666667
##      31-60                 3.899276
##      61-90                 4.548201
ggplot(csvData, aes(x = Age_Group, y = NumWebPurchases, fill = Age_Group)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Online Purchases by Age Group",
       x = "Age Group",
       y = "Number of Web Purchases") +
  theme_minimal() +
  theme(legend.position = "none",  panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    plot.title = element_text(size = 14, face = "bold",  hjust = 0.5),

    )

 

Observation: Online purchasing trends increase with age, with the youngest group (0-30) averaging 3.67 online purchases, the middle-aged (31-60) at 3.90, and the older group (61-90) leading at 4.55. This progression indicates a greater inclination towards online shopping as customers age, possibly due to higher disposable income or a preference for the convenience of online platforms.

 

Expenses vs Web Activity

csvData <- csvData %>%
  filter(Total_Spendings >= 0 & NumWebVisitsMonth <= 10)

# Group data by number of web visits per month and calculate the average expenses
web_activity_stats <- csvData %>%
  group_by(NumWebVisitsMonth) %>%
  summarise(Average_Expenses = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup()

# Print the data frame without row names
print(as.data.frame(web_activity_stats), row.names = FALSE)
##  NumWebVisitsMonth Average_Expenses
##                  0        1143.1429
##                  1        1251.1275
##                  2        1186.3465
##                  3         973.9463
##                  4         653.3548
##                  5         522.6929
##                  6         486.7168
##                  7         297.1679
##                  8         353.1228
##                  9         288.1205
##                 10         211.6667
# Create the plot with the filtered data
ggplot(csvData, aes(x = NumWebVisitsMonth, y = Total_Spendings, color = NumWebVisitsMonth)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_gradient(low = "lightblue", high = "blue") +
  labs(title = "Expenses by Web Activity",
       x = "Number of Web Visits in Last Month",
       y = "Total Expenses") +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )
## `geom_smooth()` using formula = 'y ~ x'

 

Observation: . It is observed that as the number of web visits increases up to 10 times a month, average expenses decrease. Customers with no web visits have the highest average spending of approximately $1,143. Conversely, those with 10 web visits per month have a lower average spending of roughly $212. This trend might suggest that customers who visit the website moderately are spending less on average than those with minimal to no online engagement. The highest average spending is associated with those who have the least online presence, which may imply that non-engaged customers are less exposed to online marketing efforts that could potentially drive higher spending.

 

Complaints vs Expenses

complaints_stats <- csvData %>%
  group_by(Complain) %>%
  summarise(Average_Expenses = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup()

print(as.data.frame(complaints_stats), row.names = FALSE)
##  Complain Average_Expenses
##         0         609.5973
##         1         392.0000
ggplot(csvData, aes(x = factor(Complain), y = Total_Spendings, fill = factor(Complain))) +
  geom_boxplot() +
  scale_fill_manual(values = c("0" = "lightblue", "1" = "blueviolet")) +
  labs(title = "Expenses by Complaint Status",
       x = "Complaint in Last 2 Years",
       y = "Total Expenses") +
  theme_minimal() +
  theme(legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.title = element_text(size = 14, face = "bold",  hjust = 0.5))

 

Observation: Data indicates that customers who have not registered complaints in the last two years spend more on average, with their expenses amounting to $607, compared to $392 for those who have complained. This suggests a potential correlation between customer satisfaction and spending habits, highlighting the importance of addressing customer grievances to maintain higher spending levels.

 

Recency vs Expenses

csvData <- csvData %>%
  mutate(Total_Spendings = rowSums(select(., starts_with("Mnt"))))

recency_expenses_stats <- csvData %>%
  group_by(Recency) %>%
  summarise(Average_Expenses = mean(Total_Spendings, na.rm = TRUE)) %>%
  ungroup()

print(recency_expenses_stats, row.names = FALSE)
## # A tibble: 100 × 2
##    Recency Average_Expenses
##      <dbl>            <dbl>
##  1       0             450.
##  2       1             661 
##  3       2             628.
##  4       3             637.
##  5       4             665.
##  6       5             470.
##  7       6             666.
##  8       7             605.
##  9       8             710.
## 10       9             569.
## # ℹ 90 more rows
ggplot(csvData, aes(x = Recency, y = Total_Spendings)) +
  geom_point(alpha = 0.6, color = "coral") +
  geom_smooth(method = "lm", se = FALSE, color = "darkslategray") +
  labs(title = "Expenses by Recency of Purchase",
       x = "Days Since Last Purchase (Recency)",
       y = "Total Expenses") +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )
## `geom_smooth()` using formula = 'y ~ x'

 

Observation: There is a non-linear relationship between the recency of purchases and average expenses. Customers who have made a purchase very recently, like on day 0, have an average spending of $450. Interestingly, spending appears to fluctuate regardless of recency, with some peaks at various intervals — for instance, average spending is quite high at $865 for those who made a purchase 35 days ago. This pattern could suggest sporadic purchasing behavior influenced by factors other than just the time since the last purchase

 

Key Insights

 

  • Age Distribution: A mature customer base with a median age of 53 suggests targeting middle-aged to senior adults.

  • Marital Status: A majority in relationships indicates family-oriented purchasing decisions.

  • Kidhome & Teenhome: A prevalence of smaller families highlights opportunities for child-centric marketing.

  • Education: A highly educated customer base with over 50% having graduated suggests a focus on sophisticated products.

  • Income: A strong middle-class presence with incomes primarily between $40,000 to $80,000 suggests a market for quality, yet affordable products.

  • Expenses: A significant spend on wines and meats points to a customer preference for these categories.

  • Campaigns: Low campaign acceptance rates challenge the effectiveness of marketing strategies.

  • Purchase Places: Stores dominate purchases, but a significant one-third are online, showing the importance of a dual retail approach.

  • Income vs Spendings: High income correlates with higher spending, indicating the potential for luxury marketing.

  • Age vs Spendings: Age is a weaker predictor of spending, suggesting a diversified approach across age groups.

  • Marital Status vs Spendings: Single customers spend slightly more, potentially indicating more disposable income.

  • Having Kids vs Expenses: Childless customers have significantly higher average spending, indicating more discretionary spending.

  • Education vs Income & Expenses: Higher education levels correlate with higher income and spending, underscoring targeted marketing for higher education levels.

  • Campaigns vs Expenses: Customers who engage with campaigns tend to spend more, highlighting the success of targeted marketing campaigns.

  • Age vs Product Type: Preferences for wines across age groups, with younger customers also spending significantly on meats.

  • Marital Status vs Purchase Place: Minimal difference in online purchasing between singles and those in relationships.

  • Having Kids vs Purchase Place: Those without kids shop more online, suggesting the influence of parenting responsibilities.

  • Age Group vs Purchase Place: Older customers tend to make more online purchases, indicating a potential focus on online marketing for older demographics.

  • Expenses vs Web Activity: Higher web activity correlates with lower spending, indicating a more deal-savvy or selective customer.

  • Complaints vs Expenses: Customers who haven’t complained spend more, linking customer satisfaction to spending.

  • Recency vs Expenses: A complex relationship with no clear pattern, suggesting sporadic purchasing influenced by diverse factors.

In conclusion, the study reveals a customer profile that is mature, well-educated, and family-oriented with spending habits that gravitate towards quality products like wines and meats. Campaign responsiveness and customer satisfaction emerge as pivotal factors influencing spending, while web engagement and family dynamics demonstrate nuanced effects on purchasing behavior. These insights are crucial for tailoring marketing strategies to the distinct needs and preferences of the customer base.