Introduction
In MetCoOp there is a move towards spreading the load on the HPC(s) by running a small number of ensemble members for MEPS each hour. This contrasts with the normal method of running all members every 6 hours. The enesmeble system is called CMEPS (what does the C stand for?). To produce an ensemble forecast, the members are lagged in time so that a full ensemble is available every 3 hours, which is lagged further to produce a larger ensemble every 6 hours.
For verification, this means that the data have to be lagged before the scores can be computed. In the first instance, the model outputs are interpolated to station locations and written out to FCTABLE files - at this stage, no lead times or forecast dates are updated to reflect the lagging. However, in order to read to the correct members at the correct times, the lagging information is required.
For CMEPS, members 0, 1 and 2 are run every three hours at 0, 3, 6, …, 21 UTC. For the purpose of clarity we will refer to these members as the parent members and the lagged members as the child members. The child members 3, 4 and 7 are run 1 hour later than the parents, and the child members 5, 6 and 8 are run 2 hours later than the parents, i.e. at 2, 5, 8, …, 23 UTC. However, for a lagged forecast, we are interested in the child members that belong to the previous forecast, since the child members from the current parent will not be available yet.
The conversion to sqlite can easily be done using the read_eps_interpolate function from harpIO and specifying the lagging of each member in the lags argument.
library(harpIO)
read_eps_interpolate(
2019021312,
2019022421,
"CMEPS_prod",
"Pcp",
by = "3h",
lead_time = seq(0, 63, 3),
members_in = seq(0, 8),
lags = paste0(c(0, 0, 0, 2, 2, 1, 1, 2, 1), "h"),
file_path = "/lustre/storeB/users/andrewts/harp_test_data/vfld",
sqlite_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/FCTABLE"
)
Using the standard sqlite file name template, an sqlite file is created for each lead time, containing only those members that were available for that lead time. Note that it’s quite a slow process and maybe provides motivation for re-implementing the old method for reading from vfld files - just to compare the speed, and also because support for vfld version 2 is needed.
For verification, the data are read in from the sqlite files using read_point_forecast from harpIO. The parent forecast times are basically as derived from the start_date, end_date and by arguments. All of the appropriate lagged members are read in for the relative times and lead times derived from the lags argument. Here we do not need to specifify what the lags are for each member as all data that can be found will be read in. As the lagging is done to 6 hours, need to make sure we adjust the start date to account for 6 hour lagging (take it back 3 hours, as the 0, 1, 2 lags are already there)
cmeps_pcp_3h <- read_point_forecast(
2019021321,
2019022421,
c("CMEPS_prod", "MEPS_prod"),
"EPS",
"AccPcp3h",
by = "3h",
lead_time = seq(0, 63, 3),
lags = list(CMEPS_prod = c("0h", "1h", "2h"), MEPS_prod = "0h"),
file_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/FCTABLE"
)
The above could equally be done with by = "6h" and lags = list(CMEPS_prod = paste0(seq(0, 5), "h"), MEPS_prod = "0h") without changing anything, or indeed by = "1h" and lags not set.
Before we can verify, we must actually do the lagging. This is done using the lag_forecast function from harpPoint. In this function, the forecast model in the harp_fcst object you want to do the lagging for, and the parent forecast cycles for the lagged forecast need to be specified. All child members for the parents for each cycle will automatically be found - the child cycles are essentially all cycles between the parent cycles. The function is not able to include child cycles that are older than the next oldest parent cycle, but there is a way around this that will be shown later.
For CMEPS, the intention is to have an ensemble every 6 hours including all members from the current cycle and the cycles for the previous 5 hours. We can do this by lagging the forecast with parent cycles at 0, 6, 12 and 18 UTC.
library(harpPoint)
cmeps_pcp_3h_lag6 <- lag_forecast(
cmeps_pcp_3h,
"CMEPS_prod",
seq(0, 18, 6)
)
For a fair comparison of CMEPS and MEPS, the common cases between the two systems should be selected.
cmeps_pcp_3h_lag6 <- common_cases(cmeps_pcp_3h_lag6)
Now we can read in the observations and do the verification.
obs_pcp_3h <- read_point_obs(
first_validdate(cmeps_pcp_3h_lag6),
last_validdate(cmeps_pcp_3h_lag6),
"AccPcp3h",
obs_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/OBSTABLE",
obsfile_template = "OBS_{YYYY}.sqlite"
)
cmeps_pcp_3h_lag6 <- join_to_fcst(
cmeps_pcp_3h_lag6,
obs_pcp_3h
)
verif_cmeps_pcp_3h_lag6 <- ens_verify(cmeps_pcp_3h_lag6, AccPcp3h, verify_members = FALSE)
And the plotting can be done using plot_point_verif from harpVis:
library(harpVis)
plot_point_verif(verif_cmeps_pcp_3h_lag6, spread_skill)

plot_point_verif(verif_cmeps_pcp_3h_lag6, mean_bias)

plot_point_verif(verif_cmeps_pcp_3h_lag6, crps)

The fall in number of cases with lead time is due to the observations only going to 21 UTC 24 Feb 2019.
We can also verify each cycle separately.
verif_cmeps_pcp_3h_lag6_cycles <- ens_verify(
cmeps_pcp_3h_lag6,
AccPcp3h,
groupings = c("leadtime", "fcst_cycle"),
verify_members = FALSE
)
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, spread_skill, facet_by = vars(fcst_cycle), num_facet_cols = 2)

plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, mean_bias, facet_by = vars(fcst_cycle), num_facet_cols = 2)

plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, crps, facet_by = vars(fcst_cycle), num_facet_cols = 2)

Are these the correct cycles?
By using 0, 6, 12 and 18 as the parent cycles, we are not using the optimum combination of members - we use 3 members from the most recent cycle, 9 members from the previous cycle and 6 members from the cycle before that. To use the members related to the most recent output streams it would seem advantageous to use 2, 8, 14 and 20 as the parent cycles.
cmeps_pcp_3h_lag6_off2 <- lag_forecast(
cmeps_pcp_3h,
"CMEPS_prod",
seq(2, 20, 6)
)
Here it is difficult to select the common cases with MEPS since the run times are different, but we hope for the best!
cmeps_pcp_3h_lag6_off2 <- join_to_fcst(cmeps_pcp_3h_lag6_off2, obs_pcp_3h)
We can then bind the two together - there is no dedicated function to do this yet, so a work around is used using new_harp_fcst, which is not exported from harpPoint and map2 and bind_rows from the purrr and dplyr packages respectively. Proposal for new function: bind_fcst_models
library(purrr)
library(dplyr)
cmeps_pcp_3h_lag6_all <- map2(
cmeps_pcp_3h_lag6,
cmeps_pcp_3h_lag6_off2,
bind_rows
) %>%
harpPoint:::new_harp_fcst()
Now the verfication can be done for all of these cycles.
verif_cmeps_pcp_3h_lag6_cycles <- ens_verify(
cmeps_pcp_3h_lag6_all,
AccPcp3h,
groupings = c("leadtime", "fcst_cycle"),
verify_members = FALSE
)
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, spread_skill, facet_by = vars(fcst_cycle), num_facet_cols = 2)

plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, mean_bias, facet_by = vars(fcst_cycle), num_facet_cols = 2)

plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, crps, facet_by = vars(fcst_cycle), num_facet_cols = 2)

It doesn’t appear to make a lot of difference for 3h precipitation. However, strange behaviour was seen previously with mean sea level pressure. How is this affected by using different cycles?
cmeps_pmsl <- read_point_forecast(
2019021321,
2019022421,
c("CMEPS_prod", "MEPS_prod"),
"EPS",
"Pmsl",
by = "3h",
lead_time = seq(0, 63, 3),
lags = list(CMEPS_prod = c("0h", "1h", "2h"), MEPS_prod = "0h"),
file_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/FCTABLE"
)
Note that MEPS_prod pressure is in Pa - convert to hPa for consistency with CMEPS_prod.
cmeps_pmsl$MEPS_prod <- cmeps_pmsl$MEPS_prod %>%
mutate_at(vars(contains("_mbr")), funs(. / 100))
cmeps_pmsl_lag6 <- lag_forecast(
cmeps_pmsl,
"CMEPS_prod",
seq(0, 18, 6)
) %>%
common_cases()
cmeps_pmsl_lag6_off2 <- lag_forecast(
cmeps_pmsl,
"CMEPS_prod",
seq(0, 18, 6) + 2
)
cmeps_pmsl_lag6_all <- map2(
cmeps_pmsl_lag6["CMEPS_prod"],
cmeps_pmsl_lag6_off2["CMEPS_prod"],
bind_rows
)
cmeps_pmsl_lag6_all <- c(cmeps_pmsl_lag6_all, cmeps_pmsl["MEPS_prod"]) %>%
harpPoint:::new_harp_fcst()
Now we can read in the observations and do the verification.
obs_pmsl <- read_point_obs(
first_validdate(cmeps_pmsl_lag6_all),
last_validdate(cmeps_pmsl_lag6_all),
"Pmsl",
obs_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/OBSTABLE",
obsfile_template = "OBS_{YYYY}.sqlite"
)
Note that the observations are in Pa and the forecasts are in hPa - so divide observations by 100 before joining.
obs_pmsl <- obs_pmsl %>%
mutate(Pmsl = Pmsl / 100)
cmeps_pmsl_lag6 <- join_to_fcst(cmeps_pmsl_lag6, obs_pmsl)
cmeps_pmsl_lag6_all <- join_to_fcst(cmeps_pmsl_lag6_all, obs_pmsl)
First we do the verification for all cycles together.
verif_cmeps_pmsl_lag6 <- ens_verify(cmeps_pmsl_lag6, Pmsl, verify_members = FALSE)
And plot some results.
plot_point_verif(verif_cmeps_pmsl_lag6, spread_skill)

plot_point_verif(verif_cmeps_pmsl_lag6, mean_bias)

plot_point_verif(verif_cmeps_pmsl_lag6, crps)

And now for the cycles individually:
verif_cmeps_pmsl_lag6_all <- ens_verify(cmeps_pmsl_lag6_all, Pmsl, groupings = c("leadtime", "fcst_cycle"), verify_members = FALSE)
And plot some results.
plot_point_verif(verif_cmeps_pmsl_lag6_all, spread_skill, facet_by = vars(fcst_cycle), num_facet_cols = 2)

plot_point_verif(verif_cmeps_pmsl_lag6_all, mean_bias, facet_by = vars(fcst_cycle), num_facet_cols = 2)

plot_point_verif(verif_cmeps_pmsl_lag6_all, crps, facet_by = vars(fcst_cycle), num_facet_cols = 2)

---
title: "EPS verification with a lagged ensemble using harp"
output: html_notebook
---

###Introduction
In MetCoOp there is a move towards spreading the load on the HPC(s) by running a small number of ensemble members for MEPS each hour. This contrasts with the normal method of running all members every 6 hours. The enesmeble system is called CMEPS (what does the C stand for?). To produce an ensemble forecast, the members are lagged in time so that a full ensemble is available every 3 hours, which is lagged further to produce a larger ensemble every 6 hours. 

For verification, this means that the data have to be lagged before the scores can be computed. In the first instance, the model outputs are interpolated to station locations and written out to FCTABLE files - at this stage, no lead times or forecast dates are updated to reflect the lagging. However, in order to read to the correct members at the correct times, the lagging information is required. 

For CMEPS, members 0, 1 and 2 are run every three hours at 0, 3, 6, ..., 21 UTC. For the purpose of clarity we will refer to these members as the *parent* members and the lagged members as the *child* members. The child members 3, 4 and 7 are run 1 hour later than the parents, and the child members 5, 6 and 8 are run 2 hours later than the parents, i.e. at 2, 5, 8, ..., 23 UTC. However, for a **lagged** forecast, we are interested in the child members that belong to the previous forecast, since the child members from the current parent will not be available yet. 

The conversion to sqlite can easily be done using the `read_eps_interpolate` function from harpIO and specifying the lagging of each member in the `lags` argument. 

```{r results='hide', eval=FALSE}
library(harpIO)
read_eps_interpolate(
  2019021312, 
  2019022421, 
  "CMEPS_prod", 
  "Pcp", 
  by          = "3h", 
  lead_time   = seq(0, 63, 3), 
  members_in  = seq(0, 8), 
  lags        = paste0(c(0, 0, 0, 2, 2, 1, 1, 2, 1), "h"), 
  file_path   = "/lustre/storeB/users/andrewts/harp_test_data/vfld",
  sqlite_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/FCTABLE"
)
```

Using the standard sqlite file name template, an sqlite file is created for each lead time, containing only those members that were available for that lead time. Note that it's quite a slow process and maybe provides motivation for re-implementing the old method for reading from vfld files - just to compare the speed, and also because support for vfld version 2 is needed.  

For verification, the data are read in from the sqlite files using read_point_forecast from harpIO. The parent forecast times are basically as derived from the `start_date`, `end_date` and `by` arguments. All of the appropriate lagged members are read in for the relative times and lead times derived from the `lags` argument. Here we do not need to specifify what the lags are for each member as all data that can be found will be read in. As the lagging is done to 6 hours, need to make sure we adjust the start date to account for 6 hour lagging (take it back 3 hours, as the 0, 1, 2 lags are already there)

```{r results='hide'}
cmeps_pcp_3h <- read_point_forecast(
  2019021321, 
  2019022421, 
  c("CMEPS_prod", "MEPS_prod"), 
  "EPS", 
  "AccPcp3h", 
  by        = "3h", 
  lead_time = seq(0, 63, 3), 
  lags      = list(CMEPS_prod = c("0h", "1h", "2h"), MEPS_prod = "0h"), 
  file_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/FCTABLE"
)
```

The above could equally be done with `by = "6h"` and `lags = list(CMEPS_prod = paste0(seq(0, 5), "h"), MEPS_prod = "0h")` without changing anything, or indeed `by = "1h"` and lags not set.

Before we can verify, we must actually do the lagging. This is done using the lag_forecast function from harpPoint. In this function, the forecast model in the harp_fcst object you want to do the lagging for, and the parent forecast cycles for the lagged forecast need to be specified. All child members for the parents for each cycle will automatically be found - the child cycles are essentially all cycles between the parent cycles. The function is not able to include child cycles that are older than the next oldest parent cycle, but there is a way around this that will be shown later. 

For CMEPS, the intention is to have an ensemble every 6 hours including all members from the current cycle and the cycles for the previous 5 hours. We can do this by lagging the forecast with parent cycles at 0, 6, 12 and 18 UTC. 

```{r}
library(harpPoint)
cmeps_pcp_3h_lag6 <- lag_forecast(
  cmeps_pcp_3h,
  "CMEPS_prod",
  seq(0, 18, 6)
)
```

For a fair comparison of CMEPS and MEPS, the common cases between the two systems should be selected.

```{r}
cmeps_pcp_3h_lag6 <- common_cases(cmeps_pcp_3h_lag6)
```

Now we can read in the observations and do the verification.
```{r results='hide'}
obs_pcp_3h <- read_point_obs(
  first_validdate(cmeps_pcp_3h_lag6), 
  last_validdate(cmeps_pcp_3h_lag6), 
  "AccPcp3h", 
  obs_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/OBSTABLE", 
  obsfile_template = "OBS_{YYYY}.sqlite"
)
```

```{r message=FALSE}
cmeps_pcp_3h_lag6 <- join_to_fcst(
  cmeps_pcp_3h_lag6,
  obs_pcp_3h
)
```

```{r message=FALSE, results='hide'}
verif_cmeps_pcp_3h_lag6 <- ens_verify(cmeps_pcp_3h_lag6, AccPcp3h, verify_members = FALSE)
```

And the plotting can be done using plot_point_verif from harpVis:
```{r message=FALSE, fig.height=6, fig.width=8}
library(harpVis)
plot_point_verif(verif_cmeps_pcp_3h_lag6, spread_skill)
plot_point_verif(verif_cmeps_pcp_3h_lag6, mean_bias)
plot_point_verif(verif_cmeps_pcp_3h_lag6, crps)
```

The fall in number of cases with lead time is due to the observations only going to 21 UTC 24 Feb 2019.

We can also verify each cycle separately.
```{r, message=FALSE, results='hide'}
verif_cmeps_pcp_3h_lag6_cycles <- ens_verify(
  cmeps_pcp_3h_lag6, 
  AccPcp3h, 
  groupings = c("leadtime", "fcst_cycle"), 
  verify_members = FALSE
)
```

```{r warning=FALSE}
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, spread_skill, facet_by = vars(fcst_cycle), num_facet_cols = 2)
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, mean_bias, facet_by = vars(fcst_cycle), num_facet_cols = 2)
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, crps, facet_by = vars(fcst_cycle), num_facet_cols = 2)
```

###Are these the correct cycles?
By using 0, 6, 12 and 18 as the parent cycles, we are not using the optimum combination of members - we use 3 members from the most recent cycle, 9 members from the previous cycle and 6 members from the cycle before that. To use the members related to the most recent output streams it would seem advantageous to use 2, 8, 14 and 20 as the parent cycles. 

```{r}
cmeps_pcp_3h_lag6_off2 <- lag_forecast(
  cmeps_pcp_3h, 
  "CMEPS_prod",
  seq(2, 20, 6)
)
```

Here it is difficult to select the common cases with MEPS since the run times are different, but we hope for the best!

```{r message=FALSE}
cmeps_pcp_3h_lag6_off2 <- join_to_fcst(cmeps_pcp_3h_lag6_off2, obs_pcp_3h)
```

We can then bind the two together - there is no dedicated function to do this yet, so a work around is used using new_harp_fcst, which is not exported from harpPoint and map2 and bind_rows from the purrr and dplyr packages respectively. **Proposal for new function: bind_fcst_models**

```{r}
library(purrr)
library(dplyr)
cmeps_pcp_3h_lag6_all <- map2(
  cmeps_pcp_3h_lag6,
  cmeps_pcp_3h_lag6_off2,
  bind_rows
) %>% 
harpPoint:::new_harp_fcst()
```

Now the verfication can be done for all of these cycles.
```{r, message=FALSE, results='hide'}
verif_cmeps_pcp_3h_lag6_cycles <- ens_verify(
  cmeps_pcp_3h_lag6_all, 
  AccPcp3h, 
  groupings = c("leadtime", "fcst_cycle"), 
  verify_members = FALSE
)
```

```{r warning=FALSE, fig.height=8, fig.width=8}
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, spread_skill, facet_by = vars(fcst_cycle), num_facet_cols = 2)
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, mean_bias, facet_by = vars(fcst_cycle), num_facet_cols = 2)
plot_point_verif(verif_cmeps_pcp_3h_lag6_cycles, crps, facet_by = vars(fcst_cycle), num_facet_cols = 2)
```

It doesn't appear to make a lot of difference for 3h precipitation. However, strange behaviour was seen previously with mean sea level pressure. How is this affected by using different cycles?

```{r results='hide'}
cmeps_pmsl <- read_point_forecast(
  2019021321, 
  2019022421, 
  c("CMEPS_prod", "MEPS_prod"), 
  "EPS", 
  "Pmsl", 
  by        = "3h", 
  lead_time = seq(0, 63, 3), 
  lags      = list(CMEPS_prod = c("0h", "1h", "2h"), MEPS_prod = "0h"), 
  file_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/FCTABLE"
)
```

Note that MEPS_prod pressure is in Pa - convert to hPa for consistency with CMEPS_prod. 

```{r}
cmeps_pmsl$MEPS_prod <- cmeps_pmsl$MEPS_prod %>% 
  mutate_at(vars(contains("_mbr")), funs(. / 100))
```

```{r}
cmeps_pmsl_lag6 <- lag_forecast(
  cmeps_pmsl,
  "CMEPS_prod",
  seq(0, 18, 6)
) %>% 
  common_cases()

cmeps_pmsl_lag6_off2 <- lag_forecast(
  cmeps_pmsl,
  "CMEPS_prod",
  seq(0, 18, 6) + 2
)

cmeps_pmsl_lag6_all <- map2(
  cmeps_pmsl_lag6["CMEPS_prod"],
  cmeps_pmsl_lag6_off2["CMEPS_prod"],
  bind_rows
) 

cmeps_pmsl_lag6_all <- c(cmeps_pmsl_lag6_all, cmeps_pmsl["MEPS_prod"]) %>% 
  harpPoint:::new_harp_fcst()
```

Now we can read in the observations and do the verification.
```{r results='hide'}
obs_pmsl <- read_point_obs(
  first_validdate(cmeps_pmsl_lag6_all), 
  last_validdate(cmeps_pmsl_lag6_all), 
  "Pmsl", 
  obs_path = "/lustre/storeB/users/andrewts/harp_test_data/harp_output/OBSTABLE", 
  obsfile_template = "OBS_{YYYY}.sqlite"
)
```

Note that the observations are in Pa and the forecasts are in hPa - so divide observations by 100 before joining. 

```{r}
obs_pmsl <- obs_pmsl %>% 
  mutate(Pmsl = Pmsl / 100)
```


```{r message=FALSE}
cmeps_pmsl_lag6     <- join_to_fcst(cmeps_pmsl_lag6, obs_pmsl)
cmeps_pmsl_lag6_all <- join_to_fcst(cmeps_pmsl_lag6_all, obs_pmsl)
```

First we do the verification for all cycles together.
```{r results='hide'}
verif_cmeps_pmsl_lag6 <- ens_verify(cmeps_pmsl_lag6, Pmsl, verify_members = FALSE)
```

And plot some results.
```{r message=FALSE, fig.height=6, fig.width=8}
plot_point_verif(verif_cmeps_pmsl_lag6, spread_skill)
plot_point_verif(verif_cmeps_pmsl_lag6, mean_bias)
plot_point_verif(verif_cmeps_pmsl_lag6, crps)
```

And now for the cycles individually:
```{r results='hide'}
verif_cmeps_pmsl_lag6_all <- ens_verify(cmeps_pmsl_lag6_all, Pmsl, groupings = c("leadtime", "fcst_cycle"), verify_members = FALSE)
```

And plot some results.
```{r message=FALSE, warning=FALSE, fig.height=8, fig.width=8}
plot_point_verif(verif_cmeps_pmsl_lag6_all, spread_skill, facet_by = vars(fcst_cycle), num_facet_cols = 2)
plot_point_verif(verif_cmeps_pmsl_lag6_all, mean_bias, facet_by = vars(fcst_cycle), num_facet_cols = 2)
plot_point_verif(verif_cmeps_pmsl_lag6_all, crps, facet_by = vars(fcst_cycle), num_facet_cols = 2)
```
