Diamond Dataset EDA by Kyle Maher

Setup

library(ggplot2)
library(dplyr)
library(gt)   # Table Formatting

df <- ggplot2::diamonds

Check for Null Values:

colSums(is.na(df)) %>%
  bind_rows() %>%
  gt() %>%
  tab_options(table.align = "left")

carat	cut	color	clarity	depth	table	price	x	y	z
0	0	0	0	0	0	0	0	0	0

Checking Basic Relationships With Price

Carat vs Price

# Carat vs Price
df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_point(color = "#456882") +
  labs(
    x = "Carat",
    y = "Price ($)",
    title = "Price by Carat"
  )

There appears to be a positive correlation between Carat and Price. There also appears to be more diamonds cut at particular weights. That is, there is not a continuous representation of different carats. Let’s look at this a little closer in a histogram.

# Histogram of Carat
df %>%
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = .03, fill = "#456882") +
  labs(
    x = "Carat",
    y = "Count",
    title = "Histogram of Carat"
  )

The histogram makes it a little easier to see that many diamonds appear to be cut to a round number of carats.

Cut vs Price

# Cut vs Price
df %>%
  ggplot(aes(x = cut, y = price)) +
  geom_boxplot(fill = "#456882") + 
  labs(
    x = "Cut",
    y = "Price ($)",
    title = "Price Distribution by Cut"
  )

This is interesting. One would expect higher quality cuts to have a greater price in general. In fact, the ideal cut diamonds appear to have the smallest median price.

Color vs Price

# Color vs Price
df %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot(fill = "#456882") +
  labs(
    y = "Price ($)", 
    x = "Color (Clear-Yellow)",
    title = "Price Distribution by Color"
  )

This shows the price generally increaes as the color rating becomes more yellow. Lets look at a bar chart of the mean prices based on diamond color.

df %>%
  ggplot(aes(x = color, y = price)) +
  geom_bar(stat = "summary", fun = mean, fill = "#456882") +
  labs(
    y = "Mean Price ($)", 
    x = "Color (Clear-Yellow)",
    title = "Mean Price by Color"
  )

The bar chart shows that the mean price increases for each color category change except from D to E. Lets see where the price distributions differ by color with a density plot.

df %>%
  ggplot(aes(x = price, color = color)) +
  geom_density(linewidth = .85) +
  labs(
    x = "Price ($)",
    y = "Density", 
    color = "Color",
    title = "Price Distribution by Color"
  )

The mean price differences seem to mostly stem from the count of diamonds sold between $0 and $2500.

Clarity vs Price

# Clarity vs price
df %>%
  ggplot(aes(x = clarity, y = price)) +
  geom_boxplot(fill = "#456882") +
  labs(
    x = "Clarity (Worse to Best)",
    y = "Price ($)",
    title = "Price Distribution by Clarity"
  )

This is also a similar case. As the diamond clarity increases in quality the median price generally decreases.

Investigating Color

Diamond Color Scale Background

Diamonds are ranked from D to Z where D is a colorless diamond and Z is a light yellow diamond.

The intial plots on color showed that the price was increasing as the color became progressively more yellow.

If we had data on very yellow diamonds this might make sense. However, the color scale for the dataset only ranges from D to J, which is a very slight yellow color.

Generally this slight yellow color result in the diamond being considered of a worse quality. Lets do some investigation into the color and see if this differnce in price is statistically significant.

Two Sample T-Test on Diffrence of Means

Test T-Test Assumptions

Independent Observations
Random Sampling
Normality
Constant Variance

Normality

df %>%
  count(color, name = "count") %>%
  gt() %>%
  tab_header(title = "Row Count by Color") %>%
  data_color(
    columns = count,
    fn = scales::col_numeric(
      palette = c("#F9F3EF", "#D2C1B6"),
      domain = NULL
    )
  ) %>%
  tab_options(table.align = "left")

color	count
Row Count by Color
D	6775
E	9797
F	9542
G	11292
H	8304
I	5422
J	2808

Large counts allow for non-normal distributions by Central Limit Theorem.

Variance Assumption

df %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot(fill = "#456882") +
  labs(
    y = "Price ($)", 
    x = "Color (Clear-Yellow)",
    title = "Price Distribution by Color"
  )

Since the variance varies by color we’ll use Welch’s T-Test which doesn’t assume a constant variance.

Welch’s T-Test on Successive Color Pairs:

Setup

Choose alpha = .05.

Test

color_pairs <- list(
  c("D", "E"),
  c("E", "F"),
  c("F", "G"),
  c("G", "H"),
  c("H", "I"),
  c("I", "J")
)

color_p_values <- list()  # Blank list to store results
for (pair in color_pairs) {
  less_yellow <- filter(df, color == pair[1])
  more_yellow <- filter(df, color == pair[2])

  # One sided T Test
  result <- t.test(
    less_yellow$price,
    more_yellow$price,
    alternative = "less"
  )
  color_p_values[[paste(pair[1], "vs", pair[2])]] <- result$p.value
}
# Table Output
bind_rows(color_p_values) %>%
  gt() %>%
  tab_header(
    title = "T-Test Difference of Means on Color and Price"
  ) %>%
  data_color(
    rows = 1,
    direction = "row",
    fn = scales::col_numeric(
      palette = c("#D2C1B6", "#F9F3EF"),
      domain = NULL
    )
  ) %>%
  tab_options(table.align = "left")

D vs E	E vs F	F vs G	G vs H	H vs I	I vs J
T-Test Difference of Means on Color and Price
0.9607799	1.345527e-36	2.283516e-07	2.241127e-16	1.065141e-14	0.01396682

Each successive change in color rating results in a statistically significant increase in price, except for the difference between color rating D and E.

Considerations:

There are two posibilities to consider: Either the generalization that clearer diamonds are more valuable than slightly yellow ones is wrong, or there is a confounding parameter.

Lets investigate for a potential confounding parameter between price and color.

Investigating Potential Confounding Parameters

df %>%
  ggplot(aes(x = carat, y = price, color = color)) +
  geom_point(alpha = 0.5) +
  labs(
    x = "Carat",
    y = "Price ($)",
    title = "Price by Carat"
  )

We saw the relationship between price and carat before, but adding the color makes it possible to see that for each small subset of price the diamonds that weigh more seem to be more yellow in color.

Lets see if each increase in color results in a statisticaly significant increase in weight.

Two Sample T-Test on Diffrence of Means for Color and Carat

df %>%
  ggplot(aes(x = color, y = carat)) +
  geom_boxplot(fill = "#456882") +
  labs(
    x = "Color",
    y = "Carat",
    title = "Carat by Color"
  )

Variances differ again so we will use Welch’s.

Welch’s T-Test on Successive Color Pairs:

color_pairs <- list(
  c("D", "E"),
  c("E", "F"),
  c("F", "G"),
  c("G", "H"),
  c("H", "I"),
  c("I", "J")
)

color_p_values <- list()  # Blank list to store results
for (pair in color_pairs) {
  less_yellow <- filter(df, color == pair[1])
  more_yellow <- filter(df, color == pair[2])

  # One sided T Test
  result <- t.test(
    less_yellow$carat,
    more_yellow$carat,
    alternative = "less"
  )
  color_p_values[[paste(pair[1], "vs", pair[2])]] <- result$p.value
}
# Table Output
bind_rows(color_p_values) %>%
  gt() %>%
  tab_header(
    title = "T-Test Difference of Means for Color and Carat"
  ) %>%
  data_color(
    rows = 1,
    direction = "row",
    fn = scales::col_numeric(
      palette = c("#D2C1B6", "#F9F3EF"),
      domain = NULL
    )
  ) %>%
  tab_options(table.align = "left")

D vs E	E vs F	F vs G	G vs H	H vs I	I vs J
T-Test Difference of Means for Color and Carat
0.4950059	3.284965e-46	1.295526e-09	2.744791e-87	1.971867e-32	5.072545e-23

Each jump to a more yellow color does result in a difference in the mean diamond weight. Also, the scatterplot from earlier hinted at a correlation between carat and price. This all suggests that Carat may be a confounding variable between color and price.

Linear Regression Between Carat and Price

df %>%
  ggplot(aes(x = carat, y = price)) +
  geom_point(aes(color = color)) +
  geom_smooth(method = "lm") +
  labs(
    x = "Carat",
    y = "Price ($)",
    title = "Price by Carat",
    color = "Color"
  )

# Find R Squared
lm_carat <- summary(lm(price ~ carat, data = df))
tibble(Adjusted_R_Squared = lm_carat$adj.r.squared) %>%
  gt() %>%
  tab_options(table.align = "left")

Adjusted_R_Squared
0.8493277

With a basic linear model 85% of the variation in price can be explained by the carat of the diamond.

Adding Color

lm_carat_color <- summary(lm(price ~ carat + color, data = df))
tibble(Adjusted_R_Squared = lm_carat_color$adj.r.squared) %>%
  gt() %>%
  tab_options(table.align = "left")

Adjusted_R_Squared
0.8639443

Adding color improves the correlaiton by about 1%.

Pearson and Spearman Correlations

Pearson gives the extent of linear correlation. Spearman uses ranking and can detect non-linear correlations.

# Pearsons
pearson <- cor.test(
  df$price, 
  as.numeric(df$color), 
  method = "pearson"
  )
  
pearson$estimate %>%
  bind_rows() %>%
  gt() %>%
  tab_header(
    title = "Color vs Price",
    subtitle = "Pearson's Correlation"
    ) %>%
  tab_options(table.align = "left")

cor
Color vs Price
Pearson's Correlation
0.1725109

Color can explain only 17% of the variation in price in a linear relationship.

# Spearmans
spearman <- cor.test(
  df$price, 
  as.numeric(df$color), 
  method = "spearman"
  )

Warning in cor.test.default(df$price, as.numeric(df$color), method =
"spearman"): Cannot compute exact p-value with ties

spearman$estimate %>%
  bind_rows() %>%
  gt() %>%
  tab_header(
    title = "Color vs Price",
    subtitle = "Spearman's Correlation"
    ) %>%
  tab_options(table.align = "left")

rho
Color vs Price
Spearman's Correlation
0.1501422

That correlaiton does not increase when considering non-linear correlations (spearmans).

Conclusions

While the diamond color does predict about 17% of the variation in price, much of this is likely due to the confounding parameter carat which explains about 85% of the variation in price.

In general diamonds that weigh more, cost more, regardless of the color of the diamond (when considering diamond colors between D and J).