Since the NOAA dataset includes both rainfall and temperature we can briefly explore the relationship between these variables.
Note that here I remove the first two recorded months for 1944 since they have no temperature data.
Import and setup Perth Airport weather data:
#import libraries we want
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(ggdark))
suppressMessages(library(patchwork))
#read data
<- fread("../data/3184544.csv")
NOAA_data #filter for usability
#turns out the data attributes are useless lol
<- NOAA_data %>% select(!ends_with("_ATTRIBUTES"))
NOAA_data #add month and day cols for easier group-by summaries later
<- NOAA_data %>%
NOAA_data mutate(
Year = format(DATE, "%Y") %>% as.numeric(),
Month = month(DATE),
Day = format(DATE, "%d") %>% as.numeric(),
#bool for any rain
did_rain = PRCP > 0,
Month_name = month.name[Month] %>% fct_relevel(month.name)
%>%
) #remove months with no temp recordings
filter(!((Month == 5 | Month == 6) & (Year == 1944)))
%>% head(2) NOAA_data
Summarising the rainfall and temperature by month and year, we can see that there is a non-linear relationship between temperature and rainfall. On this plot winter months are across the bottom of the plot while the summer months are concentrated into the top left. Log transformations of the variables here do not create a linear relationship.
<- NOAA_data %>%
NOAA_temp_rainfall_monthly group_by(Year, Month) %>%
summarise(
meanRainfall = mean(PRCP),
meanTMAX = mean(TMAX),
meanTMIN = mean(TMIN),
)%>%
NOAA_temp_rainfall_monthly ggplot(aes(meanRainfall, meanTMAX)) +
geom_point() +
geom_smooth() +
geom_rug(alpha=0.1) +
dark_theme_bw() +
ggtitle("relationship between monthly mean maximum temperature and monthly mean rainfall") +
xlab("mean rainfall per month (mm)") +
ylab("mean maximum temperature per month (C)")
The rug plots in the above graph also illustrate that mean TMAX is unevenly distributed. Indeed we can see peaks around 20C and a smaller peak at 30C. Plotting each month separately results in approximately normally distributed data, indicating that our multimodal distribution over the year is related to the seasons.
<- NOAA_temp_rainfall_monthly %>%
all_months_dist ggplot(aes(meanTMAX)) +
geom_histogram(bins = 50) +
dark_theme_bw() +
ggtitle("Distribution of mean maximum temperature for each month at Perth Airport") +
xlab("mean maximum temperature per month (C)")
<- NOAA_temp_rainfall_monthly %>%
seperate_months_dist ggplot(aes(meanTMAX)) +
geom_histogram() +
facet_wrap(vars(Month), scales = "free") +
dark_theme_bw() +
ggtitle("Distribution of mean maximum temperature, each month seperately")
+ seperate_months_dist all_months_dist
The above indicates we could get a robust estimation of the relationship between TMAX and rainfall with a linear model where we include Month as a variable.
Indeed, we can see that when we run a module on the data (not summarised) we get a robust Q-Q plot (indicating the assumption of normality holds, alhough with some inflation). The residuals are concerning as there are clear patterns indicating some irregularity in the data (i.e. the assumptions of residual normality doesn’t hold). While there are some outliers (“residuals vs leverage” plot) there is little to be concerned with considering the sample size here is large.
As we would expect, there is a significant negative relationship between rainfall (PRCP) and maximum temperature, which we can describe as a 1mm increase in rainfall relates to a decrease in temperature of 0.141 degrees celsius. Interestingly, we can see the model has defaulted to january as the compartor month. The coefficients clearly show how temperature (and rainfall) differ from january for other months.
#add month as factor to NOAA data
<- NOAA_data %>% mutate(Month_factor = as.factor(Month))
NOAA_data #run model
<- lm(
model_TMAX_rainfall_month ~ PRCP + Month_factor,
TMAX data = NOAA_data
)summary(model_TMAX_rainfall_month)
##
## Call:
## lm(formula = TMAX ~ PRCP + Month_factor, data = NOAA_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6659 -2.4918 -0.3597 2.1400 15.2454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.859973 0.077062 413.434 <2e-16 ***
## PRCP -0.140623 0.004005 -35.109 <2e-16 ***
## Month_factor2 0.187212 0.111582 1.678 0.0934 .
## Month_factor3 -2.027881 0.108956 -18.612 <2e-16 ***
## Month_factor4 -6.000284 0.109952 -54.572 <2e-16 ***
## Month_factor5 -9.603007 0.109527 -87.677 <2e-16 ***
## Month_factor6 -12.114067 0.111526 -108.621 <2e-16 ***
## Month_factor7 -13.169896 0.110206 -119.502 <2e-16 ***
## Month_factor8 -12.720515 0.109498 -116.171 <2e-16 ***
## Month_factor9 -11.251883 0.109818 -102.459 <2e-16 ***
## Month_factor10 -8.872458 0.108689 -81.632 <2e-16 ***
## Month_factor11 -5.668185 0.109523 -51.753 <2e-16 ***
## Month_factor12 -2.605357 0.108696 -23.969 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.788 on 28647 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.6385, Adjusted R-squared: 0.6383
## F-statistic: 4216 on 12 and 28647 DF, p-value: < 2.2e-16
#get diagnostic plots
plot(model_TMAX_rainfall_month)
We have previously established that rainfall and temperature trends are changing with each year. With increasing year, rainfall is decreasing, and temperature is increasing.
By including year in this model we can inspect the relationship between rainfall and maximum temperature while holding the yearly trends constant. From the model below we can see that holding the yearly trend constant does not impact the estimate of the relationship between TMAX and rainfall. We can see in this model that the coefficient describing the effect of year is positive (i.e increasing year means increasing temperature).
#run model
<- lm(
model_TMAX_rainfall_month_year ~ PRCP + Month_factor + Year,
TMAX data = NOAA_data
)summary(model_TMAX_rainfall_month_year)
##
## Call:
## lm(formula = TMAX ~ PRCP + Month_factor + Year, data = NOAA_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.432 -2.480 -0.387 2.107 15.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.323e+01 1.944e+00 -6.807 1.02e-11 ***
## PRCP -1.376e-01 3.971e-03 -34.644 < 2e-16 ***
## Month_factor2 1.869e-01 1.105e-01 1.691 0.0909 .
## Month_factor3 -2.028e+00 1.079e-01 -18.788 < 2e-16 ***
## Month_factor4 -6.004e+00 1.089e-01 -55.113 < 2e-16 ***
## Month_factor5 -9.611e+00 1.085e-01 -88.572 < 2e-16 ***
## Month_factor6 -1.213e+01 1.105e-01 -109.765 < 2e-16 ***
## Month_factor7 -1.317e+01 1.092e-01 -120.642 < 2e-16 ***
## Month_factor8 -1.272e+01 1.085e-01 -117.247 < 2e-16 ***
## Month_factor9 -1.125e+01 1.088e-01 -103.367 < 2e-16 ***
## Month_factor10 -8.864e+00 1.077e-01 -82.315 < 2e-16 ***
## Month_factor11 -5.658e+00 1.085e-01 -52.143 < 2e-16 ***
## Month_factor12 -2.591e+00 1.077e-01 -24.057 < 2e-16 ***
## Year 2.273e-02 9.792e-04 23.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.753 on 28646 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.6452, Adjusted R-squared: 0.645
## F-statistic: 4007 on 13 and 28646 DF, p-value: < 2.2e-16
Using summarised data allows us to create a linear model that better fits the assumptions of linear regression. As is clear in the residuals diagnostic plot, groupings of residuals are each normally distributed (apparently from lack of structure in “blobs”, groupings relate to the seasonal nature of temperature, x-axis). Considering the coefficients, we see that the estimate for the relationship between mean TMAX and mean rainfall broadly reflects that of the non-summarised data above. Here, we can say that a 1mm increase in mean rainfall relates to a decrease in mean temperature of 0.264 degrees celsius. Once again, Year positively correlates with mean TMAX, and our comparator month (July) shows a predicatable relationship with other months and temperature.
<- NOAA_temp_rainfall_monthly %>%
NOAA_temp_rainfall_monthly mutate(Month_factor = as.factor(Month))
<- lm(
model_TMAX_rainfall_month_year_summary ~ meanRainfall + Month_factor + Year,
meanTMAX data = NOAA_temp_rainfall_monthly
)summary(model_TMAX_rainfall_month_year_summary)
##
## Call:
## lm(formula = meanTMAX ~ meanRainfall + Month_factor + Year, data = NOAA_temp_rainfall_monthly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0532 -0.7446 -0.0679 0.8255 4.1997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.807609 3.652235 -6.519 1.17e-10 ***
## meanRainfall -0.264320 0.032238 -8.199 8.07e-16 ***
## Month_factor8 0.311367 0.202851 1.535 0.125
## Month_factor9 1.606338 0.215337 7.460 2.00e-13 ***
## Month_factor10 3.858711 0.229750 16.795 < 2e-16 ***
## Month_factor11 6.997987 0.238568 29.333 < 2e-16 ***
## Month_factor12 9.997563 0.248591 40.217 < 2e-16 ***
## Month_factor1 12.609753 0.248979 50.646 < 2e-16 ***
## Month_factor2 12.797814 0.245121 52.210 < 2e-16 ***
## Month_factor3 10.585007 0.245223 43.165 < 2e-16 ***
## Month_factor4 6.674466 0.231802 28.794 < 2e-16 ***
## Month_factor5 3.333562 0.208360 15.999 < 2e-16 ***
## Month_factor6 1.070201 0.200578 5.336 1.20e-07 ***
## Year 0.021739 0.001827 11.899 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.248 on 923 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.9426, Adjusted R-squared: 0.9418
## F-statistic: 1166 on 13 and 923 DF, p-value: < 2.2e-16
plot(model_TMAX_rainfall_month_year_summary)
In summary, temperature and rainfall exhibit a non-linear, inverse, relationship in Perth. As rainfall increases, temperature decreases. This is broadly due to the nature of rainfall occuring mostly in winter and very rarely in summer. Holding month and year variables constant establishes that this relationship still holds for temperature and rainfall. Using mean temperature and mean rainfall per month and per year allows us to build a linear model that satisfies the assumptions of linear regression, allowing us to be more confident in the results.