model <-lm(price ~ carat, data = df_train)df_test$pred <-predict(model, newdata = df_test)df_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_line(aes(y = pred), color ="#25b9c7", linewidth =1.2) +labs(x ="Carat", y ="Price ($)", title ="Basic Linear Model Using Carat to Predict the Price")
This obviously is not the best fit. However, lets define how poor it actually is.
Well, it looks like there are a lot of things wrong here. Most of the red lines are not close to horizontal and the Q-Q plot has some extreme tails. Not to mention the large RMSE.
Model Version #2 (Log Scale)
model <-lm(log(price) ~log(carat), data = df_train)# See (1) for log backwards conversion formuladf_test$pred <-exp(predict(model, df_test) +summary(model)$sigma^2/2)df_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_line(aes(y = pred), color ="#25b9c7", linewidth =1.2) +labs(x ="Carat", y ="Price ($)", title ="Linear Model Using the Natural Log of Carat to Predict the Natural Log of Price")
Finding the linear model at the log scale gives it a curved fit once the reverse transformation is done. This enables it to better follow the distribution.
While the diagnostic plots look significantly better, the R Squared did not improve and RMSE actually got worse. Also there are several points with significant leverage. It may make sense to break up the model by carat to address that.
Splitting Up The Model
Where To Split
df %>%ggplot(aes(x = carat, y = price, color = clarity)) +geom_point() +scale_x_log10(breaks = scales::log_breaks(n =10)) +scale_y_log10(breaks = scales::log_breaks(n =10)) +geom_vline(xintercept =c(.5, 2), color ="red", linetype ="dashed") +labs(x ="Carat",y ="Price ($)",title ="Carat and Price by Clarity at a Log Scale" )
The splits .5 and 2 Carats were chosen based on general break points.
model <-lm(log(price) ~log(carat), data = df_small_train)df_small_test$pred <-exp(predict(model, df_small_test) +summary(model)$sigma^2/2)df_small_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_line(aes(y = pred), color ="#25b9c7", linewidth =1.2) +labs(x ="Carat", y ="Price ($)", title ="Using Carat to Predict Price for Small Diamonds")
Under .5 Carats the Carat almost acts as a categorical variable due to the very distinct levels. In theory converting it to a factor may improve the results, but lets keep moving.
Even with the almost discrete nature, this model was able to predict the diamond Price within about $194 for diamonds less than .5 Carats only using the Carat.
Medium Diamonds
model <-lm(log(price) ~log(carat), data = df_medium_train)df_medium_test$pred <-exp(predict(model, df_medium_test) +summary(model)$sigma^2/2)df_medium_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_line(aes(y = pred), color ="#25b9c7", linewidth =1.2) +labs(x ="Carat", y ="Price ($)", title ="Linear Model Using the Natural Log of Carat to Predict the Natural Log of Price for Medium Diamonds")
This line looks pretty good, however there is some significant variation in price at the same carat which may hinder the model performance.
The extreme variation results in a somewhat poor prediction for medium sized diamonds. Since there is extreme variation in price at the same carat level, adding the categorical variables to the model may improve it significantly.
Large Diamonds
model <-lm(log(price) ~log(carat), data = df_large_train)df_large_test$pred <-exp(predict(model, df_large_test) +summary(model)$sigma^2/2)df_large_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_line(aes(y = pred), color ="#25b9c7", linewidth =1.2) +labs(x ="Carat", y ="Price ($)", title ="Linear Model Using the Natural Log of Carat to Predict the Natural Log of Price for Large Diamonds")
Woah there. This does not look like a good case for a linear model.
The diagnostic plots confirm that less data and a non visable linear relationship here makes this model quite poor. Lets focus on the small and medium sized diamonds limiting the size to under 2 carats.
Model Version 4 (Split & Log Scale & Categorical Parameters)
Small Diamonds
model <-lm(log(price) ~log(carat) + clarity + color + cut, data = df_small_train)df_small_test$pred <-exp(predict(model, df_small_test) +summary(model)$sigma^2/2)df_small_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_point(aes(y = pred), color ="#25b9c7", alpha = .1) +labs(x ="Carat", y ="Price ($)", title ="Four C's to Predict Price for Small Diamonds")
Adding the categorical variables causes this to no longer be a simple linear model. As a result we would expect the correlation to improve drasticly.
Yes, we are able to predict the price to within about $106 for Small Diamonds on average.
Medium Diamonds
model <-lm(log(price) ~log(carat) + clarity + color + cut, data = df_medium_train)df_medium_test$pred <-exp(predict(model, df_medium_test) +summary(model)$sigma^2/2)df_medium_test %>%ggplot(aes(x = carat, y = price)) +geom_point(color ="#456882", alpha = .3) +geom_point(aes(y = pred), color ="#25b9c7", alpha = .1) +labs(x ="Carat", y ="Price ($)", title ="Four C's to Predict Price for Medium Diamonds")
We would expect a better correlation here as well.
# Training RMSE to test for overfittingdf_train$pred <-exp(predict(model, df_train) +summary(model)$sigma^2/2)rmse <-sqrt(mean((df_train$price - df_train$pred)^2))show_rmse(rmse)
Root_Mean_Squared_Error
622.7129
Since the rmse for the Training and Testing data are similar it is unlikely that the model is overfit. It is likely that this model can actually predict the diamond price within around $610 on average.
Conclusions
For diamonds less than 2 carats, a model with an R Squared of .98 was reached which is able to predict the diamond price using Carat, Clarity, Color, and Cut with a RMSE of $610.
For diamonds smaller than .5 Carats model version 4 was able to predict the price with a RMSE of $106.
Unanswered Questions:
There appears to be a price where no diamonds are for sale around $1500.
References
(1) Reverse Transformation For The Mean For Log Transformed Data