import and clean data

Import data and add some useful columns to our rain dataframe.

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(ggdark))
suppressMessages(library(patchwork))
NOAA_data <- fread("../data/3184544.csv")
#filter for usability
#turns out the data attributes are useless lol
NOAA_data <- NOAA_data %>% select(!ends_with("_ATTRIBUTES"))
#add month and day cols for easier group-by summaries later
NOAA_data <- NOAA_data %>% 
  mutate(
    Year = format(DATE, "%Y") %>% as.numeric(),
    Month = month(DATE),
    Day = format(DATE, "%d") %>% as.numeric(),
    #bool for any rain
    did_rain = PRCP > 0,
    Month_name = month.name[Month] %>% fct_relevel(month.name)
  )
NOAA_data %>% head(2)

data properties

We should take a quick look at how the temperature min and max variables are distributed.

For temperature we can see that TMIN broadly follows a Normal distribution but appears to have a non standard shape and a slight skew.

TMAX follows a non-standard distribution, with the tail to the left of the mode (approximately 18 Celsius) quickly decreasing in frequency compared to the right hand side.

#quick histograms using ggplot so we can easily apply a dark theme
TMAX_hist <-ggplot(aes(TMAX), data=NOAA_data) + geom_histogram() + dark_theme_bw()
TMIN_hist <-ggplot(aes(TMIN), data=NOAA_data) + geom_histogram() + dark_theme_bw()
#plot using patchwork
TMAX_hist / TMIN_hist

Fortunately, we can see that taking mean values, such as mean yearly maximum and minimum temperature in a given month results in a more normally distributed variable. Although it is still somewhat irregular it definitely better fits the assumptions for linear regression.

TMAX_histogram <- NOAA_data %>% 
  group_by(Month_name, Year) %>% 
  summarise(meanTMAX = mean(TMAX, na.rm = T)) %>% 
  ggplot(aes(meanTMAX, after_stat(density))) +
  geom_histogram() +
  geom_density() +
  facet_wrap(vars(Month_name), scales="free") +
  dark_theme_bw() +
ggtitle("Mean maximum temperature per year")
TMIN_histogram <- NOAA_data %>% 
  group_by(Month_name, Year) %>% 
  summarise(meanTMIN = mean(TMIN, na.rm = T)) %>% 
  ggplot(aes(meanTMIN, after_stat(density))) +
  geom_histogram() +
  geom_density() +
  facet_wrap(vars(Month_name), scales="free") +
  dark_theme_bw() +
  ggtitle("Mean minimum temperature per year")
#plot using patchwork
TMAX_histogram / TMIN_histogram

Temperature extremes

Extremely hot days are interesting for 2 reasons: 1. they are apparent to the people who live in a location 2. they are catastrophic to wildlife and agriculture

As Perth warms up, we can expect to see more extreme temperatures. Indeed, we can visualise the days over 35 celsius and see that the number of days over 35C at Perth Airport is increasing each year.

daysOver35 <- NOAA_data %>% 
  filter(Year > 1944) %>% 
  #drop empty TMAX cols
  drop_na(TMAX) %>% 
  group_by(Year) %>% 
  count(daysOver35 = TMAX > 35) %>% 
  #pivot wider so that under and over 35 on the same row
  pivot_wider(id_cols=Year, names_from = daysOver35, values_from = n) %>% 
  #rename new cols to informative names
  rename("under35" = "FALSE", "over35" = "TRUE") %>% 
  #where over35 is NaN there are no days over 35
  mutate(over35 = replace_na(over35, 0)) %>% 
  #get proportion
  mutate(proportion_over35 = over35/(under35 + over35))
daysOver35 %>%  ggplot(aes(Year, proportion_over35)) +
  geom_point() +
  geom_smooth(method="lm") +
  dark_theme_bw() +
  ggtitle("Proportion of days over 35C each year at Perth Airport") +
  ylab("Proportion of days over 35C")

#run a simple lm for this
broom::tidy(summary(lm(proportion_over35 ~ Year, data = daysOver35)))

While we might expect this to translate to more days in a row over 35, the trend for this is weak, although it is nominally significant.

#get streaks of days over 35C
NOAA_data <- NOAA_data %>% 
    mutate(daysOver35 = TMAX > 35) %>% 
    #we use data.table::rleid to get group IDs for each streak
    group_by(IDstreakDaysOver35 = rleid(daysOver35)) %>% 
    #we use dplyr::mutate and base::ifelse to calculate streakDaysOver35
    mutate(streakDaysOver35 = ifelse(daysOver35, row_number(), row_number()-1))
#format a table to have each streak of days over 35 with relevant stats
meanStreakOver35_year <- NOAA_data %>% 
  filter(daysOver35) %>% 
  drop_na(TMAX) %>% 
  select(DATE, Month, Year, TMAX, streakDaysOver35, daysOver35) %>% 
  group_by(IDstreakDaysOver35) %>% 
  summarise(
    startDate = min(DATE),
    startYear = min(Year),
    startMonth = min(Month),
    maxDaysOver35 = max(streakDaysOver35),
  ) %>% 
  #get mean streak days for each start year
  group_by(startYear) %>% 
  summarise(meanStreakLength = mean(maxDaysOver35))
meanStreakOver35_year %>% 
  ggplot(aes(startYear, meanStreakLength)) +
  geom_point() +
  geom_smooth(method="lm") +
  dark_theme_bw() +
  ggtitle("mean number of days over 35 in a row each year at Perth Airport") +
  ylab("mean number of days over 35") +
  xlab("Year streak started")

broom::tidy(summary(lm(meanStreakLength ~ startYear, data = meanStreakOver35_year)))

summary

The temperatures in Perth are increasing! Using Perth Airport data we can adequately describe a linear relationship between temperature and time. While more complex models might fit better, they are less parsimonious and thus it as inadvisable to use them here.