Import data and add some useful columns to our rain dataframe. This includes a boolean for whether it rained. Really just allows me to have more comprehensible code and saves me adding an extra label to my ggplots.
#import libraries we want
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(ggdark))
suppressMessages(library(patchwork))
#read data
<- fread("../data/3184544.csv")
NOAA_data #filter for usability
#turns out the data attributes are useless lol
<- NOAA_data %>% select(!ends_with("_ATTRIBUTES"))
NOAA_data #add month and day cols for easier group-by summaries later
<- NOAA_data %>%
NOAA_data mutate(
Year = format(DATE, "%Y") %>% as.numeric(),
Month = month(DATE),
Day = format(DATE, "%d") %>% as.numeric(),
#bool for any rain
did_rain = PRCP > 0,
Month_name = month.name[Month] %>% fct_relevel(month.name)
)%>% head(2) NOAA_data
We should take a quick look at how rainfall and temperature are distributed.
We can see that Precipitation is broadly exponentially distributed and that a vast majority of days have none to very little rain. A second graph shows that the pattern holds when 0 rain days are removed.
#we can use hist() with custom breaks to get a good feel for data shape here
hist(NOAA_data$PRCP, breaks = 50)
#same thing, but remove no rain days
hist(pull(filter(NOAA_data, PRCP != 0), PRCP), breaks = 50)
Often, taking mean values of non normal data will result in data that
tends to be normally distributed, and thus more tractable, data. To see
if that held here, I calculated the mean rainfall for each
month, year
pair and plotted the histograms. Unfortunately
this assumption does not hold here. My first impression, since rainfall
is restricted to \(x \ge 0\), is that
each month is roughly following a chi squared distribution with a
different parameter value (larger values of \(k\)). Interestingly, the graph helps
illustrate the highly seasonal nature of rainfall in Perth which I will
explore further below.
%>%
NOAA_data #get mean rainfall for month,year pairs
group_by(Month_name, Year) %>%
summarise(meanRainfall = mean(PRCP, na.rm = T)) %>%
#plot
ggplot(aes(meanRainfall, after_stat(density))) +
geom_histogram() +
geom_density() +
#use facet_wrap to split plot by month
facet_wrap(vars(Month_name), scales="free") +
dark_theme_bw() +
ggtitle("Histograms of mean rainfall per year for each month")
Probably the most striking feature of rain in Perth is just how seasonal it is. Looking at our first graph below we can see any given day in January has less than 10% chance of any rain while a given day in July has over 50% chance of any rain! The month to month change is dramatic, with the change in rainfall chance shifting rapidly in autumn and spring.
While summers are incredibly hot, and the UV index is dangerous, summer is a great time to plan outdoor activities that would be otherwise ruined by rain. In contrast, expect the opposite in winter.
#here we want a table where:
#each row = a day of the year
#and we want the probability of rainfall for that day
<- NOAA_data %>%
rainy_days group_by(Month, Day) %>%
summarise(
number_of_rainy_days = sum(did_rain), days_observed = n()
%>%
) #calculate proportion rainfall for a given day
mutate(proportion_rained = number_of_rainy_days/days_observed)
#we add an index so we can count the days
$index <- 1:length(rainy_days$Day)
rainy_days#we add custom date strings for displaying on xlabel
<- rainy_days %>%
rainy_days mutate(date_str = str_c(Day, Month, sep = "/"))
#and custom breaks for aesthetically showing these dates
= c(
custom_breaks 1,32,32+29,32+29+31,92+30,153,153+30,214,245,275,306,336,366
)%>%
rainy_days #multiply by 100 to get percentage
ggplot(aes(index, proportion_rained*100)) +
geom_point() +
geom_smooth(span=0.5) +
scale_y_continuous(breaks = seq(0,1,0.1)*100) +
ylab("Percentage of days with any rain") +
xlab("Date") +
scale_x_continuous(
breaks = custom_breaks, labels = rainy_days$date_str[custom_breaks]
+
) ggtitle("Should you plan to go outside? Percent chance of any rain for Perth Airport") +
dark_theme_classic()
Perth is experiencing huge changes to it’s rainfall. We can illustrate this in our next plot that shows the total rainfall each year. As we can there is steady trend of decreasing rainfall year-on-year.
<- NOAA_data %>%
rainfall_totals_yearly group_by(Year) %>% summarise(total.Yearly.Rainfall = sum(PRCP))
%>%
rainfall_totals_yearly ggplot(aes(Year, total.Yearly.Rainfall)) +
geom_point() +
geom_smooth(method="lm") +
#use expand in scale param to remove annoying scale padding
scale_x_continuous(breaks = c(1945,1960,1980,2000,2020), expand = c(0.01,0.01)) +
#we use the below to remove some whitespace on this plot
scale_y_continuous(
expand = c(0.01,0.01),
limits = c(0,max(rainfall_totals_yearly$total.Yearly.Rainfall))
+
) ggtitle("Total yearly rainfall for Perth Airport (1945 to 2022)") +
ylab("total yearly rainfall (mm)") +
dark_theme_classic()
To get some specific numbers on this, we can run a linear regression
and consider the intercept and slope. This will produce the same result
as the line through the graph above, as
ggplot::geom_smooth(method="lm")
is using the same function
as below (lm()
).
#to use our intercept we set 1944 to be "year zero"
= 1944
base_year #add a variable that uses 1944 as the base year
<- rainfall_totals_yearly %>% mutate(Year_zeroed = Year-base_year)
rainfall_totals_yearly #make a statistical model
<- lm("total.Yearly.Rainfall ~ Year_zeroed" , data = rainfall_totals_yearly)
rainfall_model_lm #print coefficients as a table
::tidy(summary(rainfall_model_lm)) broom
We can see that in 1944, the average rainfall is 866mm (i.e. the intercept value) and that rainfall decreases at a rate of 2.8mm a year. In 2020 we would expect the rainfall to be 652.54 mm (the actual total was 636mm). That is a decrease of 211.29mm in just 75 years.
Here is a table showing some predicted rainfall values compared to actual totals, and then extended into the future. We can see if trends continued then our model predicts that rainfall would dwindle to zero. Of course, we can’t expect trends to continue, and extending a relationship outside the measured boundaries is always
#some years we want to predict rainfall for
= c(1950, 1970, 2020, 2040, 2080, 2100, 2200)
Years #use predict() to make predictions for Years and stuff this into a table
<- broom::tidy(
rainfallPredictionTable predict(rainfall_model_lm, list(Year_zeroed=Years-base_year))
)#rename Years to Year
$Year <- Years
rainfallPredictionTable#and rename x to a meaningful name
$Predicted_Rainfall <- rainfallPredictionTable$x
rainfallPredictionTable#merge with rainfall yearly totals
<- merge(rainfallPredictionTable, rainfall_totals_yearly %>%
rainfallPredictionTable filter(Year %in% Years) %>% select(Year, total.Yearly.Rainfall) %>%
rename(Actual_rainfall = total.Yearly.Rainfall),
all.x = TRUE)
#we can then take the difference of the predicted and actual rainfall totals for years with that data included
#and print the table to our notebook
%>%
rainfallPredictionTable select(Year, Predicted_Rainfall, Actual_rainfall) %>%
mutate(Difference = Actual_rainfall - Predicted_Rainfall)
This dataset is large enough for us to consider a deeper dive into Perth’s changing rainfall. We can explore a few questions with graphs and statistical models: 1. in which months is rainfall increasing or decreasing? 2. is the time of peak rainfall each year changing or constant?
It is interesting to consider whether rainfall patterns are changing in the same way for all months and times of year. We can use linear models to approach this in two different ways:
Year = total_rainfall
separately for each monthYear = total_rainfall * month
While the latter is more elegant, the former is easier to illustrate on a graph. We can see the results on the graph below (“Perth rainfall monthly totals”). Here, we can see that the relationship between Total Rainfall and Year differs across the year. We can also plot in on the winter and summer months to get a better feel for this.
We can see that rainfall totals are decreasing in winter, but appear to be steady or possibly increasing in summer (although we can’t be certain of this as the p-values for the coefficients are not even nominally significance).
#find min and max year
= NOAA_data$Year %>% min
min_year = NOAA_data$Year %>% max
max_year #rainfall totals by Year and Month
<- NOAA_data %>%
rainfall_totals group_by(Year, Month) %>%
summarise(total_rainfall = sum(PRCP))
#now we make a series of plots
#allmonths does scatter plots of rainfall vs day facetted by each month
<- rainfall_totals %>%
allmonths filter(Year != 1944) %>%
ggplot(aes(Year, total_rainfall)) +
#facet by month
facet_grid(cols=vars(Month)) +
geom_point(alpha = 0.5) +
#linear OLS for each month
geom_smooth(method="lm", colour = "red") +
scale_x_continuous(breaks = c(min_year+1, max_year-1)) +
ggtitle("Perth rainfall monthly totals for each month between 1945 and 2022") +
dark_theme_classic() +
#we adjust labels to fit better here
theme(axis.text.x=element_text(angle=60,hjust=1))
#winter and summer are the same plots but only for the summer and winter months,
#for illustration
<- rainfall_totals %>%
winter filter(Year != 1944, Month %in% 6:8) %>%
ggplot(aes(Year, total_rainfall)) +
facet_grid(cols=vars(Month)) +
geom_point(alpha = 0.5) +
geom_smooth(method="lm", colour = "red") +
scale_x_continuous(breaks = c(min_year+1, max_year-1)) +
ggtitle("Winter Months") +
dark_theme_classic() +
theme(axis.text.x=element_text(angle=60,hjust=1))
<- rainfall_totals %>%
summer filter(Year != 1944, Month %in% c(1:2,12)) %>%
ggplot(aes(Year, total_rainfall)) +
facet_grid(cols=vars(Month)) +
geom_point(alpha = 0.5) +
geom_smooth(method="lm", colour = "red") +
scale_x_continuous(breaks = c(min_year+1, max_year-1)) +
ggtitle("Summer months") +
dark_theme_classic() +
theme(axis.text.x=element_text(angle=60,hjust=1))
/(winter | summer) allmonths
We can inspect the validity of this observation by looking at the
coefficients and p-values for the Year ~ Rainfall
relationship for each month. The table and plot below show the
coefficients, standard error estimates, and p-values for each month’s
model. While the p-values, for the coefficient, for almost all months
were nominally significant (indicating that the relationship was not
zero, i.e no change) only June was significant after correcting for
multiple testing.
Interestingly, the rate of change, as expressed by the absolute magnitude of the coefficient differs considerably between months. The winter months had by far the largest rates of change. Since the winter months have negative coefficients, we interpret this as saying that rainfall is decreasing in these months much faster than in the warmer months.
We can also see the summer months have very little estimated change in their year to year rainfall totals. But there are tentative, although non-significant, indications of increasing rainfall.
#here we take linear models of total_rainfall ~ Year for each month
#to do this we group by Month and use dplyr::group_modify to perform lm() group-wise
<- rainfall_totals %>%
models.year_vs_rainfall_per_month mutate(Year = Year - base_year) %>%
group_by(Month) %>%
#we use broom::tidy to stuff each month into a table that is
#then smoothly returned as a df
group_modify(~ broom::tidy(lm(total_rainfall ~ Year, data = .)))
#show the table after adding nominal and bonferroni filter variables
%>%
models.year_vs_rainfall_per_month mutate(nominal_significant = p.value < 0.05) %>%
mutate(passes_bonferroni = p.value < 0.05/12)
#plot the coefficients with SE lines for each month
%>%
models.year_vs_rainfall_per_month filter(term == "Year") %>%
ggplot(aes(Month, estimate, ymin=estimate-std.error, ymax=estimate+std.error, colour = -log10(p.value))) +
dark_theme_classic() +
geom_point() +
geom_pointrange() +
#we add a line at zero to illustrate the positive and negative coefficients better
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.8, colour = "red") +
ggtitle("Estimated change in average total monthly rainfall per year") +
ylab("estimated change rainfall (mm)") +
#custom label breaks to get month integers
scale_x_continuous(breaks = 1:12)
We can also briefly pause to see how using Month as an interaction term can robustly support the hypothesis that the year-on-year change in rainfall total is influenced by the month.
In order for this to work effectively we need to modify our Month
variable so that it is either a factor
or so that the
winter months are the lowest or highest valued months.
While using month as a factor avoids any assumptions about months being in a specific order, it also dilutes our test by requiring each Month to act as it’s own variable.
Instead, we can reparameterise our Month variable so that summer months have larger absolute values and winter months have smaller absolute values. We do this by subtracting the Month integer by 6 and taking the absolute value of this, leaving us with 7 unique values from 0 and 6.
#add year baselined at 1944 and Month integer centred on July
#these help with the linear model
<- rainfall_totals %>%
rainfall_totals mutate(Year_zeroed = Year - base_year) %>%
mutate(Month_reparam = abs(Month-6))
#generate a linear model
<- lm(total_rainfall ~ Year_zeroed*Month_reparam, data=rainfall_totals)
year.by.month.model #and return coefficients as a table to the notebook
::tidy(summary(year.by.month.model)) broom
We can now run our model using rainfall totals for each month, year pair. The result shows approximately what we would expect; rainfall is estimated to be decreasing year-on-year while there is a very strong negative relationship between our reparamterised Month variable (i.e rainfall decreases with increasing values, which fits with summer (larger values) having very low rainfall). We can also see that our interaction term indicates a small, but statistically significant, interaction between Year and Month. The coefficient is positive, which supports the argument that rainfall is decreasing in winter while the relationship is weaker in the warmer months.
Another apparent trend in Perth’s weather pattern is that the period of peak rainfall is shifting to later in the year. Traditionally, June has been the month with the highest rainfall totals. But as we have seen above, June is experiencing a faster decline in rainfall with each year compared to any other month. This might lead us to think that another winter month might end up with greater rainfall in future years.
To test this in a visual way, we can inspect the rainfall difference for each year between pairs of months.
#we generate a table that:
# 1. takes rainfall totals by year,month pairs
# 2. pivots wider to get each month's total rainfall as a column variable
# 3. subtract month variables from each other to get difference between months
<- rainfall_totals %>%
rainfall_totals_bigger_month select(Year, Month, total_rainfall) %>%
pivot_wider(names_from = c(Month), values_from = total_rainfall) %>%
#do the subtraction here
mutate(
delta_6_5 = `6` - `5`,
delta_7_6 = `7` - `6`,
delta_8_7 = `8` - `7`,
delta_9_8 = `9` - `8`,
)#then we generate plots for each month in turn
<- rainfall_totals_bigger_month %>%
may ggplot(aes(Year, delta_6_5)) + geom_point() +
#add a line at 0 i.e. no difference to highlight positive negative spread
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.8, colour = "red") +
geom_smooth(method="glm") +
ggtitle("Difference in total rainfall in\nJune compared to May") +
ylab("Difference (mm)") + dark_theme_bw() + ylim(-400,400) +
annotate("text", label = "May", x = 1980, y = 300, size = 8) +
annotate("text", label = "June", x = 1980, y = -300, size = 8)
<- rainfall_totals_bigger_month %>%
june ggplot(aes(Year, delta_7_6)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.8, colour = "red") +
geom_smooth(method="glm") +
ggtitle("Difference in total rainfall in\nJuly compared to June") +
ylab("Difference (mm)") +
dark_theme_bw() +
ylim(-400,400) +
annotate("text", label = "July", x = 1980, y = 300, size = 8) +
annotate("text", label = "June", x = 1980, y = -300, size = 8)
<- rainfall_totals_bigger_month %>%
july ggplot(aes(Year, delta_8_7)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.8, colour = "red") +
geom_smooth(method="glm") +
ggtitle("Difference in total rainfall in\nAugust compared to July") +
ylab("Difference (mm)") +
dark_theme_bw() +
ylim(-400,400) +
annotate("text", label = "August", x = 1980, y = 300, size = 8) +
annotate("text", label = "July", x = 1980, y = -300, size = 8)
<- rainfall_totals_bigger_month %>%
august ggplot(aes(Year, delta_9_8)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.8, colour = "red") +
geom_smooth(method="glm") +
ggtitle("Difference in total rainfall in\nSeptember compared to August") +
ylab("Difference (mm)") +
dark_theme_bw() +
ylim(-400,400) +
annotate("text", label = "September", x = 1980, y = 300, size = 8) +
annotate("text", label = "August", x = 1980, y = -300, size = 8)
#use patchwork to layout the plots
+ june) / (july + august) (may
Selective cumsum for days without rain 🌧? This works!
Solution here inspired from here:
<- NOAA_data %>%
NOAA_data #we use data.table::rleid to get group IDs for each streak
group_by(NoRainStreakID = rleid(PRCP)) %>%
#we use dplyr::mutate and base::ifelse to calculate daysNoRain
mutate(daysNoRain = ifelse(PRCP == 0, row_number(), row_number()-1))
#we plot the summary of the daysNoRain variable for inspection, looks sane!
$daysNoRain %>% summary NOAA_data
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 5.949 7.000 122.000 1
When we plot the distribution of no rain streaks we can see that the no rain streaks are exponentially distributed.
The vast majority of no-rain streaks are short with a long tail of longer streaks.
#use ggplot for a histogram with density overlay
%>%
NOAA_data #we use after_stat to make geom_histogram use density for the y-axis
ggplot(aes(daysNoRain, after_stat(density))) +
geom_histogram(bins = 50) +
geom_density() +
dark_theme_bw()
The 122 days of no rain?! Apparently the summer of 2010. From the 21st of November until the 23rd of March. Wow.
#get max days no rain record (122)
= NOAA_data$daysNoRain %>% max(., na.rm = T)
maxDaysNoRain #get the date of the day with 122 days of no rain i.e the last one
= NOAA_data[which(NOAA_data$daysNoRain==maxDaysNoRain),]$DATE %>% as.Date()
lastDroughtDay
#filter and show the dates during the longest no rain period
%>%
NOAA_data mutate(DATE = as.Date(DATE)) %>%
filter(DATE < lastDroughtDay+1) %>%
filter(DATE > (lastDroughtDay-123))
#corr streak length vs TMAX
#inherent bias in precision by number of days...
#and same problem with time of year bias since heat cor with rainfall already
%>%
NOAA_data #group by rain streak ID
group_by(NoRainStreakID) %>%
#get statistics by group
summarise(
n = n(),
length = max(daysNoRain),
meanTMAX = mean(TMAX, na.rm = T),
%>% drop_na() %>% #nan values where no TMAX recorded
) ggplot(aes(length, meanTMAX, group = length)) +
geom_boxplot() + #boxplot of meanTMAX for each integer length
dark_theme_bw() +
ggtitle("boxplot of mean maximum temperature for each no-rain streak\nvs number of days without rain") +
ylab("mean maximum temperature") +
xlab("number of days without rain")
#longer streaks in summer, illustrate
#simplest is just groupby month and do boxplots
%>%
NOAA_data group_by(NoRainStreakID, Month=Month_name) %>%
summarise(daysNoRain) %>%
#because we are doing this by month we remove days over 30 as they bias towards March
# filter(daysNoRain <=29) %>%
ggplot(aes(Month, daysNoRain, group = Month, fill = Month)) +
geom_boxplot() +
dark_theme_bw() +
#adjust xlabels to fit
theme(axis.text.x=element_text(angle=60,hjust=1)) +
ggtitle("Boxplot of daysWithoutRain streaks ")
Here, I have grouped no rain streaks by month (regardless of year) and have then plotted box plots of the number of days of each streak. This gives us a distribution of the no rain streak lengths for each month. We can see that winter months rarely have periods of no rain longer than a few days.
%>%
NOAA_data #group by rain streak ID, we include Month_name here as a labelling variable
#Month_name here will mean we include the same streak multiple times if it spills over into the next month
group_by(NoRainStreakID, Month_name) %>%
#get statistics by group
summarise(
n = n(),
length = max(daysNoRain),
meanTMAX = mean(TMAX, na.rm = T),
%>%
) group_by(Month=Month_name) %>%
summarise(length) %>%
ggplot(aes(Month, length, group = Month, fill = Month)) +
geom_boxplot() +
dark_theme_bw() +
#adjust xlabels to fit
theme(axis.text.x=element_text(angle=60,hjust=1)) +
ggtitle("Boxplot of no rain streaks by month") +
ylab("Days without rain")
Oh boy. We can fairly easily group by Year, and streak ID to get the mean number of days each streak lasts for each year. Similar to rainfall over the years, we see a trend of increasing mean number of days with no rain in a row. Visually we can see this trend is linear. There is an outlier year in 2010 which corresponds to the very long streak of no rain over that year’s summer (starting in November 2009).
#streak by year are they getting longer?
<- NOAA_data %>%
noRainStreakByYear group_by(NoRainStreakID, Year) %>%
#get statistics by group
summarise(
n = n(),
length = max(daysNoRain),
meanTMAX = mean(TMAX, na.rm = T),
meanRainfall = mean(PRCP),
%>%
) group_by(Year) %>%
summarise(
longest_streak = max(length),
mean_streak = mean(length)
) %>% ggplot(aes(Year, mean_streak)) +
noRainStreakByYear geom_point() +
geom_smooth(method="lm") +
dark_theme_bw() +
ggtitle("Mean number of days with no rain in a row for each year at Perth Airport")
<- lm(mean_streak ~ Year, data = noRainStreakByYear)
noRainStreakByYear.Model ::tidy(summary(noRainStreakByYear.Model)) broom