preamble

Since the NOAA dataset includes both rainfall and temperature we can briefly explore the relationship between these variables.

Note that here I remove the first two recorded months for 1944 since they have no temperature data.

Import and setup Perth Airport weather data:

#import libraries we want
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(ggdark))
suppressMessages(library(patchwork))
#read data
NOAA_data <- fread("../data/3184544.csv")
#filter for usability
#turns out the data attributes are useless lol
NOAA_data <- NOAA_data %>% select(!ends_with("_ATTRIBUTES"))
#add month and day cols for easier group-by summaries later
NOAA_data <- NOAA_data %>% 
  mutate(
    Year = format(DATE, "%Y") %>% as.numeric(),
    Month = month(DATE),
    Day = format(DATE, "%d") %>% as.numeric(),
    #bool for any rain
    did_rain = PRCP > 0,
    Month_name = month.name[Month] %>% fct_relevel(month.name)
  ) %>% 
  #remove months with no temp recordings
  filter(!((Month == 5 | Month == 6) & (Year == 1944)))
NOAA_data %>% head(2)

Summarising the rainfall and temperature by month and year, we can see that there is a non-linear relationship between temperature and rainfall. On this plot winter months are across the bottom of the plot while the summer months are concentrated into the top left. Log transformations of the variables here do not create a linear relationship.

NOAA_temp_rainfall_monthly <- NOAA_data %>%
  group_by(Year, Month) %>% 
  summarise(
    meanRainfall = mean(PRCP),
    meanTMAX = mean(TMAX),
    meanTMIN = mean(TMIN),
  )
NOAA_temp_rainfall_monthly %>% 
  ggplot(aes(meanRainfall, meanTMAX)) + 
  geom_point() + 
  geom_smooth() +
  geom_rug(alpha=0.1) +
  dark_theme_bw() +
  ggtitle("relationship between monthly mean maximum temperature and monthly mean rainfall") +
  xlab("mean rainfall per month (mm)") +
  ylab("mean maximum temperature per month (C)")

The rug plots in the above graph also illustrate that mean TMAX is unevenly distributed. Indeed we can see peaks around 20C and a smaller peak at 30C. Plotting each month separately results in approximately normally distributed data, indicating that our multimodal distribution over the year is related to the seasons.

all_months_dist <- NOAA_temp_rainfall_monthly %>% 
  ggplot(aes(meanTMAX)) + 
  geom_histogram(bins = 50) +
  dark_theme_bw() +
  ggtitle("Distribution of mean maximum temperature for each month at Perth Airport") +
  xlab("mean maximum temperature per month (C)")
seperate_months_dist <- NOAA_temp_rainfall_monthly %>% 
  ggplot(aes(meanTMAX)) + 
  geom_histogram() +
  facet_wrap(vars(Month), scales = "free") +
  dark_theme_bw() +
  ggtitle("Distribution of mean maximum temperature, each month seperately")
all_months_dist + seperate_months_dist

linear modelling

non summarised data

The above indicates we could get a robust estimation of the relationship between TMAX and rainfall with a linear model where we include Month as a variable.

Indeed, we can see that when we run a module on the data (not summarised) we get a robust Q-Q plot (indicating the assumption of normality holds, alhough with some inflation). The residuals are concerning as there are clear patterns indicating some irregularity in the data (i.e. the assumptions of residual normality doesn’t hold). While there are some outliers (“residuals vs leverage” plot) there is little to be concerned with considering the sample size here is large.

As we would expect, there is a significant negative relationship between rainfall (PRCP) and maximum temperature, which we can describe as a 1mm increase in rainfall relates to a decrease in temperature of 0.141 degrees celsius. Interestingly, we can see the model has defaulted to january as the compartor month. The coefficients clearly show how temperature (and rainfall) differ from january for other months.

#add month as factor to NOAA data
NOAA_data <- NOAA_data %>% mutate(Month_factor = as.factor(Month))
#run model
model_TMAX_rainfall_month <- lm(
  TMAX ~ PRCP + Month_factor,
  data = NOAA_data
)
summary(model_TMAX_rainfall_month)
## 
## Call:
## lm(formula = TMAX ~ PRCP + Month_factor, data = NOAA_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6659  -2.4918  -0.3597   2.1400  15.2454 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     31.859973   0.077062  413.434   <2e-16 ***
## PRCP            -0.140623   0.004005  -35.109   <2e-16 ***
## Month_factor2    0.187212   0.111582    1.678   0.0934 .  
## Month_factor3   -2.027881   0.108956  -18.612   <2e-16 ***
## Month_factor4   -6.000284   0.109952  -54.572   <2e-16 ***
## Month_factor5   -9.603007   0.109527  -87.677   <2e-16 ***
## Month_factor6  -12.114067   0.111526 -108.621   <2e-16 ***
## Month_factor7  -13.169896   0.110206 -119.502   <2e-16 ***
## Month_factor8  -12.720515   0.109498 -116.171   <2e-16 ***
## Month_factor9  -11.251883   0.109818 -102.459   <2e-16 ***
## Month_factor10  -8.872458   0.108689  -81.632   <2e-16 ***
## Month_factor11  -5.668185   0.109523  -51.753   <2e-16 ***
## Month_factor12  -2.605357   0.108696  -23.969   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.788 on 28647 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6383 
## F-statistic:  4216 on 12 and 28647 DF,  p-value: < 2.2e-16
#get diagnostic plots
plot(model_TMAX_rainfall_month)

We have previously established that rainfall and temperature trends are changing with each year. With increasing year, rainfall is decreasing, and temperature is increasing.

By including year in this model we can inspect the relationship between rainfall and maximum temperature while holding the yearly trends constant. From the model below we can see that holding the yearly trend constant does not impact the estimate of the relationship between TMAX and rainfall. We can see in this model that the coefficient describing the effect of year is positive (i.e increasing year means increasing temperature).

#run model
model_TMAX_rainfall_month_year <- lm(
  TMAX ~ PRCP + Month_factor + Year,
  data = NOAA_data
)
summary(model_TMAX_rainfall_month_year)
## 
## Call:
## lm(formula = TMAX ~ PRCP + Month_factor + Year, data = NOAA_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.432  -2.480  -0.387   2.107  15.217 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    -1.323e+01  1.944e+00   -6.807 1.02e-11 ***
## PRCP           -1.376e-01  3.971e-03  -34.644  < 2e-16 ***
## Month_factor2   1.869e-01  1.105e-01    1.691   0.0909 .  
## Month_factor3  -2.028e+00  1.079e-01  -18.788  < 2e-16 ***
## Month_factor4  -6.004e+00  1.089e-01  -55.113  < 2e-16 ***
## Month_factor5  -9.611e+00  1.085e-01  -88.572  < 2e-16 ***
## Month_factor6  -1.213e+01  1.105e-01 -109.765  < 2e-16 ***
## Month_factor7  -1.317e+01  1.092e-01 -120.642  < 2e-16 ***
## Month_factor8  -1.272e+01  1.085e-01 -117.247  < 2e-16 ***
## Month_factor9  -1.125e+01  1.088e-01 -103.367  < 2e-16 ***
## Month_factor10 -8.864e+00  1.077e-01  -82.315  < 2e-16 ***
## Month_factor11 -5.658e+00  1.085e-01  -52.143  < 2e-16 ***
## Month_factor12 -2.591e+00  1.077e-01  -24.057  < 2e-16 ***
## Year            2.273e-02  9.792e-04   23.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.753 on 28646 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.6452, Adjusted R-squared:  0.645 
## F-statistic:  4007 on 13 and 28646 DF,  p-value: < 2.2e-16

summarised model

Using summarised data allows us to create a linear model that better fits the assumptions of linear regression. As is clear in the residuals diagnostic plot, groupings of residuals are each normally distributed (apparently from lack of structure in “blobs”, groupings relate to the seasonal nature of temperature, x-axis). Considering the coefficients, we see that the estimate for the relationship between mean TMAX and mean rainfall broadly reflects that of the non-summarised data above. Here, we can say that a 1mm increase in mean rainfall relates to a decrease in mean temperature of 0.264 degrees celsius. Once again, Year positively correlates with mean TMAX, and our comparator month (July) shows a predicatable relationship with other months and temperature.

NOAA_temp_rainfall_monthly <- NOAA_temp_rainfall_monthly %>% 
  mutate(Month_factor = as.factor(Month))

model_TMAX_rainfall_month_year_summary <- lm(
  meanTMAX ~ meanRainfall + Month_factor + Year,
  data = NOAA_temp_rainfall_monthly
)
summary(model_TMAX_rainfall_month_year_summary)
## 
## Call:
## lm(formula = meanTMAX ~ meanRainfall + Month_factor + Year, data = NOAA_temp_rainfall_monthly)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0532 -0.7446 -0.0679  0.8255  4.1997 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -23.807609   3.652235  -6.519 1.17e-10 ***
## meanRainfall    -0.264320   0.032238  -8.199 8.07e-16 ***
## Month_factor8    0.311367   0.202851   1.535    0.125    
## Month_factor9    1.606338   0.215337   7.460 2.00e-13 ***
## Month_factor10   3.858711   0.229750  16.795  < 2e-16 ***
## Month_factor11   6.997987   0.238568  29.333  < 2e-16 ***
## Month_factor12   9.997563   0.248591  40.217  < 2e-16 ***
## Month_factor1   12.609753   0.248979  50.646  < 2e-16 ***
## Month_factor2   12.797814   0.245121  52.210  < 2e-16 ***
## Month_factor3   10.585007   0.245223  43.165  < 2e-16 ***
## Month_factor4    6.674466   0.231802  28.794  < 2e-16 ***
## Month_factor5    3.333562   0.208360  15.999  < 2e-16 ***
## Month_factor6    1.070201   0.200578   5.336 1.20e-07 ***
## Year             0.021739   0.001827  11.899  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.248 on 923 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.9426, Adjusted R-squared:  0.9418 
## F-statistic:  1166 on 13 and 923 DF,  p-value: < 2.2e-16
plot(model_TMAX_rainfall_month_year_summary)

summary

In summary, temperature and rainfall exhibit a non-linear, inverse, relationship in Perth. As rainfall increases, temperature decreases. This is broadly due to the nature of rainfall occuring mostly in winter and very rarely in summer. Holding month and year variables constant establishes that this relationship still holds for temperature and rainfall. Using mean temperature and mean rainfall per month and per year allows us to build a linear model that satisfies the assumptions of linear regression, allowing us to be more confident in the results.