library(ggplot2)
library(dplyr)
library(gt) # Table Formatting
df <- ggplot2::diamondsDiamond Dataset EDA by Kyle Maher
Setup
Check for Null Values:
colSums(is.na(df)) %>%
bind_rows() %>%
gt() %>%
tab_options(table.align = "left")| carat | cut | color | clarity | depth | table | price | x | y | z |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Checking Basic Relationships With Price
Carat vs Price
# Carat vs Price
df %>%
ggplot(aes(x = carat, y = price)) +
geom_point(color = "#456882") +
labs(
x = "Carat",
y = "Price ($)",
title = "Price by Carat"
)There appears to be a positive correlation between Carat and Price. There also appears to be more diamonds cut at particular weights. That is, there is not a continuous representation of different carats. Let’s look at this a little closer in a histogram.
# Histogram of Carat
df %>%
ggplot(aes(x = carat)) +
geom_histogram(binwidth = .03, fill = "#456882") +
labs(
x = "Carat",
y = "Count",
title = "Histogram of Carat"
)The histogram makes it a little easier to see that many diamonds appear to be cut to a round number of carats.
Cut vs Price
# Cut vs Price
df %>%
ggplot(aes(x = cut, y = price)) +
geom_boxplot(fill = "#456882") +
labs(
x = "Cut",
y = "Price ($)",
title = "Price Distribution by Cut"
)This is interesting. One would expect higher quality cuts to have a greater price in general. In fact, the ideal cut diamonds appear to have the smallest median price.
Color vs Price
# Color vs Price
df %>%
ggplot(aes(x = color, y = price)) +
geom_boxplot(fill = "#456882") +
labs(
y = "Price ($)",
x = "Color (Clear-Yellow)",
title = "Price Distribution by Color"
)This shows the price generally increaes as the color rating becomes more yellow. Lets look at a bar chart of the mean prices based on diamond color.
df %>%
ggplot(aes(x = color, y = price)) +
geom_bar(stat = "summary", fun = mean, fill = "#456882") +
labs(
y = "Mean Price ($)",
x = "Color (Clear-Yellow)",
title = "Mean Price by Color"
)The bar chart shows that the mean price increases for each color category change except from D to E. Lets see where the price distributions differ by color with a density plot.
df %>%
ggplot(aes(x = price, color = color)) +
geom_density(linewidth = .85) +
labs(
x = "Price ($)",
y = "Density",
color = "Color",
title = "Price Distribution by Color"
)The mean price differences seem to mostly stem from the count of diamonds sold between $0 and $2500.
Clarity vs Price
# Clarity vs price
df %>%
ggplot(aes(x = clarity, y = price)) +
geom_boxplot(fill = "#456882") +
labs(
x = "Clarity (Worse to Best)",
y = "Price ($)",
title = "Price Distribution by Clarity"
)This is also a similar case. As the diamond clarity increases in quality the median price generally decreases.
Investigating Color
Diamond Color Scale Background
Diamonds are ranked from D to Z where D is a colorless diamond and Z is a light yellow diamond.
The intial plots on color showed that the price was increasing as the color became progressively more yellow.
If we had data on very yellow diamonds this might make sense. However, the color scale for the dataset only ranges from D to J, which is a very slight yellow color.
Generally this slight yellow color result in the diamond being considered of a worse quality. Lets do some investigation into the color and see if this differnce in price is statistically significant.
Two Sample T-Test on Diffrence of Means
Test T-Test Assumptions
- Independent Observations
- Random Sampling
- Normality
- Constant Variance
Normality
df %>%
count(color, name = "count") %>%
gt() %>%
tab_header(title = "Row Count by Color") %>%
data_color(
columns = count,
fn = scales::col_numeric(
palette = c("#F9F3EF", "#D2C1B6"),
domain = NULL
)
) %>%
tab_options(table.align = "left")| Row Count by Color | |
| color | count |
|---|---|
| D | 6775 |
| E | 9797 |
| F | 9542 |
| G | 11292 |
| H | 8304 |
| I | 5422 |
| J | 2808 |
Large counts allow for non-normal distributions by Central Limit Theorem.
Variance Assumption
df %>%
ggplot(aes(x = color, y = price)) +
geom_boxplot(fill = "#456882") +
labs(
y = "Price ($)",
x = "Color (Clear-Yellow)",
title = "Price Distribution by Color"
)Since the variance varies by color we’ll use Welch’s T-Test which doesn’t assume a constant variance.
Welch’s T-Test on Successive Color Pairs:
Setup
Choose alpha = .05.
Test
color_pairs <- list(
c("D", "E"),
c("E", "F"),
c("F", "G"),
c("G", "H"),
c("H", "I"),
c("I", "J")
)
color_p_values <- list() # Blank list to store results
for (pair in color_pairs) {
less_yellow <- filter(df, color == pair[1])
more_yellow <- filter(df, color == pair[2])
# One sided T Test
result <- t.test(
less_yellow$price,
more_yellow$price,
alternative = "less"
)
color_p_values[[paste(pair[1], "vs", pair[2])]] <- result$p.value
}
# Table Output
bind_rows(color_p_values) %>%
gt() %>%
tab_header(
title = "T-Test Difference of Means on Color and Price"
) %>%
data_color(
rows = 1,
direction = "row",
fn = scales::col_numeric(
palette = c("#D2C1B6", "#F9F3EF"),
domain = NULL
)
) %>%
tab_options(table.align = "left")| T-Test Difference of Means on Color and Price | |||||
| D vs E | E vs F | F vs G | G vs H | H vs I | I vs J |
|---|---|---|---|---|---|
| 0.9607799 | 1.345527e-36 | 2.283516e-07 | 2.241127e-16 | 1.065141e-14 | 0.01396682 |
Each successive change in color rating results in a statistically significant increase in price, except for the difference between color rating D and E.
Considerations:
There are two posibilities to consider: Either the generalization that clearer diamonds are more valuable than slightly yellow ones is wrong, or there is a confounding parameter.
Lets investigate for a potential confounding parameter between price and color.
Investigating Potential Confounding Parameters
df %>%
ggplot(aes(x = carat, y = price, color = color)) +
geom_point(alpha = 0.5) +
labs(
x = "Carat",
y = "Price ($)",
title = "Price by Carat"
)We saw the relationship between price and carat before, but adding the color makes it possible to see that for each small subset of price the diamonds that weigh more seem to be more yellow in color.
Lets see if each increase in color results in a statisticaly significant increase in weight.
Two Sample T-Test on Diffrence of Means for Color and Carat
df %>%
ggplot(aes(x = color, y = carat)) +
geom_boxplot(fill = "#456882") +
labs(
x = "Color",
y = "Carat",
title = "Carat by Color"
)Variances differ again so we will use Welch’s.
Welch’s T-Test on Successive Color Pairs:
color_pairs <- list(
c("D", "E"),
c("E", "F"),
c("F", "G"),
c("G", "H"),
c("H", "I"),
c("I", "J")
)
color_p_values <- list() # Blank list to store results
for (pair in color_pairs) {
less_yellow <- filter(df, color == pair[1])
more_yellow <- filter(df, color == pair[2])
# One sided T Test
result <- t.test(
less_yellow$carat,
more_yellow$carat,
alternative = "less"
)
color_p_values[[paste(pair[1], "vs", pair[2])]] <- result$p.value
}
# Table Output
bind_rows(color_p_values) %>%
gt() %>%
tab_header(
title = "T-Test Difference of Means for Color and Carat"
) %>%
data_color(
rows = 1,
direction = "row",
fn = scales::col_numeric(
palette = c("#D2C1B6", "#F9F3EF"),
domain = NULL
)
) %>%
tab_options(table.align = "left")| T-Test Difference of Means for Color and Carat | |||||
| D vs E | E vs F | F vs G | G vs H | H vs I | I vs J |
|---|---|---|---|---|---|
| 0.4950059 | 3.284965e-46 | 1.295526e-09 | 2.744791e-87 | 1.971867e-32 | 5.072545e-23 |
Each jump to a more yellow color does result in a difference in the mean diamond weight. Also, the scatterplot from earlier hinted at a correlation between carat and price. This all suggests that Carat may be a confounding variable between color and price.
Linear Regression Between Carat and Price
df %>%
ggplot(aes(x = carat, y = price)) +
geom_point(aes(color = color)) +
geom_smooth(method = "lm") +
labs(
x = "Carat",
y = "Price ($)",
title = "Price by Carat",
color = "Color"
)# Find R Squared
lm_carat <- summary(lm(price ~ carat, data = df))
tibble(Adjusted_R_Squared = lm_carat$adj.r.squared) %>%
gt() %>%
tab_options(table.align = "left")| Adjusted_R_Squared |
|---|
| 0.8493277 |
With a basic linear model 85% of the variation in price can be explained by the carat of the diamond.
Adding Color
lm_carat_color <- summary(lm(price ~ carat + color, data = df))
tibble(Adjusted_R_Squared = lm_carat_color$adj.r.squared) %>%
gt() %>%
tab_options(table.align = "left")| Adjusted_R_Squared |
|---|
| 0.8639443 |
Adding color improves the correlaiton by about 1%.
Pearson and Spearman Correlations
Pearson gives the extent of linear correlation. Spearman uses ranking and can detect non-linear correlations.
# Pearsons
pearson <- cor.test(
df$price,
as.numeric(df$color),
method = "pearson"
)
pearson$estimate %>%
bind_rows() %>%
gt() %>%
tab_header(
title = "Color vs Price",
subtitle = "Pearson's Correlation"
) %>%
tab_options(table.align = "left")| Color vs Price |
| Pearson's Correlation |
| cor |
|---|
| 0.1725109 |
Color can explain only 17% of the variation in price in a linear relationship.
# Spearmans
spearman <- cor.test(
df$price,
as.numeric(df$color),
method = "spearman"
)Warning in cor.test.default(df$price, as.numeric(df$color), method =
"spearman"): Cannot compute exact p-value with ties
spearman$estimate %>%
bind_rows() %>%
gt() %>%
tab_header(
title = "Color vs Price",
subtitle = "Spearman's Correlation"
) %>%
tab_options(table.align = "left")| Color vs Price |
| Spearman's Correlation |
| rho |
|---|
| 0.1501422 |
That correlaiton does not increase when considering non-linear correlations (spearmans).
Conclusions
While the diamond color does predict about 17% of the variation in price, much of this is likely due to the confounding parameter carat which explains about 85% of the variation in price.
In general diamonds that weigh more, cost more, regardless of the color of the diamond (when considering diamond colors between D and J).