This tutorial shows you how to read in data from sqlite files and run a verification process.
In part 1 we converted the forecast data and ensemble data into sqlite format. This next step is to read the data back in and run the verification. Before we begin let’s just make sure we have all the packages we need attached (we now need harpPoint as well as harpIO), and the metadata - i.e. paths etc.
library(harpIO)
library(harpPoint)
fctable_dir <- "/scratch/ms/no/fa1m/harp_output/FCTABLE"
obstable_dir <- "/scratch/ms/no/fa1m/harp_output/OBSTABLE"
first_fcst <- 2017053000
last_fcst <- 2017053100
fcst_freq <- "1d"
fcst_models <- c("model1", "model2", "model3")
param <- "T2m"
lead_times <- seq(0, 36, 3)
Now we can read in the forecast data using read_point_forecast from the harpIO package and then print out the screen by simply typing the variable name.
fcst_T2m <- read_point_forecast(
start_date = first_fcst,
end_date = last_fcst,
fcst_model = fcst_models,
fcst_type = "eps",
parameter = param,
lead_time = lead_times,
file_path = fctable_dir
)
fcst_T2m
## ● model1
## # A tibble: 17,444 x 10
## SID fcdate leadtime validdate model1_mbr000 model1_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1014 1.50e9 3 1.50e9 276. 276.
## 4 1015 1.50e9 3 1.50e9 277. 277.
## 5 1018 1.50e9 3 1.50e9 273. 273.
## 6 1023 1.50e9 3 1.50e9 275. 274.
## 7 1025 1.50e9 3 1.50e9 275. 275.
## 8 1026 1.50e9 3 1.50e9 274. 274.
## 9 1027 1.50e9 3 1.50e9 275. 275.
## 10 1033 1.50e9 3 1.50e9 276. 275.
## # … with 17,434 more rows, and 4 more variables: model1_mbr002 <dbl>,
## # model1_mbr003 <dbl>, model1_mbr004 <dbl>, fcst_cycle <chr>
##
## ● model2
## # A tibble: 32,396 x 13
## SID fcdate leadtime validdate model2_mbr000 model2_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 0 1.50e9 271. 272.
## 2 1010 1.50e9 0 1.50e9 276. 276.
## 3 1014 1.50e9 0 1.50e9 274. 274.
## 4 1015 1.50e9 0 1.50e9 276. 276.
## 5 1018 1.50e9 0 1.50e9 272. 272.
## 6 1023 1.50e9 0 1.50e9 273. 273.
## 7 1025 1.50e9 0 1.50e9 274. 274.
## 8 1026 1.50e9 0 1.50e9 274. 273.
## 9 1027 1.50e9 0 1.50e9 275. 275.
## 10 1033 1.50e9 0 1.50e9 276. 276.
## # … with 32,386 more rows, and 7 more variables: model2_mbr002 <dbl>,
## # model2_mbr003 <dbl>, model2_mbr004 <dbl>, model2_mbr005 <dbl>,
## # model2_mbr006 <dbl>, model2_mbr007 <dbl>, fcst_cycle <chr>
##
## ● model3
## # A tibble: 32,396 x 13
## SID fcdate leadtime validdate model3_mbr000 model3_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 0 1.50e9 271. 272.
## 2 1010 1.50e9 0 1.50e9 276. 277.
## 3 1014 1.50e9 0 1.50e9 274. 275.
## 4 1015 1.50e9 0 1.50e9 276. 277.
## 5 1018 1.50e9 0 1.50e9 272. 273.
## 6 1023 1.50e9 0 1.50e9 273. 273.
## 7 1025 1.50e9 0 1.50e9 274. 275.
## 8 1026 1.50e9 0 1.50e9 274. 274.
## 9 1027 1.50e9 0 1.50e9 275. 275.
## 10 1033 1.50e9 0 1.50e9 276. 276.
## # … with 32,386 more rows, and 7 more variables: model3_mbr002 <dbl>,
## # model3_mbr003 <dbl>, model3_mbr004 <dbl>, model3_mbr005 <dbl>,
## # model3_mbr006 <dbl>, model3_mbr007 <dbl>, fcst_cycle <chr>
The first thing you should notice is how much quicker that was than reading in the vfld files! As you can see, the variable we called fcst_T2m is a list of 3 tables - one for each model. You will see that both model2 and model3 have 32,396 rows whereas model1 only has 17,444 rows. Remember those missing file warnings from Part1? This is the reason for the different number of rows. When we verify different models we want the verification to be fair and use only the common cases for each model. Fortunately we have a way to select the common cases using the common_cases function from harpPoint. We will also introduce %>% here, which is the pipe operator. Essentially this means send what is on the left side of %>% to the function on the right side. Any extra arguments to the function go inside the brackets.
fcst_T2m <- fcst_T2m %>%
common_cases()
fcst_T2m
## ● model1
## # A tibble: 17,444 x 10
## SID fcdate leadtime validdate model1_mbr000 model1_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1014 1.50e9 3 1.50e9 276. 276.
## 4 1015 1.50e9 3 1.50e9 277. 277.
## 5 1018 1.50e9 3 1.50e9 273. 273.
## 6 1023 1.50e9 3 1.50e9 275. 274.
## 7 1025 1.50e9 3 1.50e9 275. 275.
## 8 1026 1.50e9 3 1.50e9 274. 274.
## 9 1027 1.50e9 3 1.50e9 275. 275.
## 10 1033 1.50e9 3 1.50e9 276. 275.
## # … with 17,434 more rows, and 4 more variables: model1_mbr002 <dbl>,
## # model1_mbr003 <dbl>, model1_mbr004 <dbl>, fcst_cycle <chr>
##
## ● model2
## # A tibble: 17,444 x 13
## SID fcdate leadtime validdate model2_mbr000 model2_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1014 1.50e9 3 1.50e9 276. 276.
## 4 1015 1.50e9 3 1.50e9 277. 277.
## 5 1018 1.50e9 3 1.50e9 273. 273.
## 6 1023 1.50e9 3 1.50e9 275. 275.
## 7 1025 1.50e9 3 1.50e9 275. 275.
## 8 1026 1.50e9 3 1.50e9 274. 274.
## 9 1027 1.50e9 3 1.50e9 275. 275.
## 10 1033 1.50e9 3 1.50e9 276. 275.
## # … with 17,434 more rows, and 7 more variables: model2_mbr002 <dbl>,
## # model2_mbr003 <dbl>, model2_mbr004 <dbl>, model2_mbr005 <dbl>,
## # model2_mbr006 <dbl>, model2_mbr007 <dbl>, fcst_cycle <chr>
##
## ● model3
## # A tibble: 17,444 x 13
## SID fcdate leadtime validdate model3_mbr000 model3_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1014 1.50e9 3 1.50e9 276. 276.
## 4 1015 1.50e9 3 1.50e9 277. 277.
## 5 1018 1.50e9 3 1.50e9 273. 273.
## 6 1023 1.50e9 3 1.50e9 275. 275.
## 7 1025 1.50e9 3 1.50e9 275. 275.
## 8 1026 1.50e9 3 1.50e9 274. 274.
## 9 1027 1.50e9 3 1.50e9 275. 275.
## 10 1033 1.50e9 3 1.50e9 276. 276.
## # … with 17,434 more rows, and 7 more variables: model3_mbr002 <dbl>,
## # model3_mbr003 <dbl>, model3_mbr004 <dbl>, model3_mbr005 <dbl>,
## # model3_mbr006 <dbl>, model3_mbr007 <dbl>, fcst_cycle <chr>
Now you will see that all models have 17,444 rows of data, so any verification will be fair.
Now we have the forecast data read in, it’s time to read in the observations with the read_point_obs function from harpIO. Typically OBSTABLE files will have 1 year of data in them so to get the dates we need we can use the first_validdate and last_validdate functions from the harpPoint package to get that information fromt thje forecast data.
obs_T2m <- read_point_obs(
start_date = first_validdate(fcst_T2m),
end_date = last_validdate(fcst_T2m),
parameter = param,
obs_path = obstable_dir
)
## Getting T2m observations for 201705300300-201705311200
##
## Reading: /scratch/ms/no/fa1m/harp_output/OBSTABLE/OBSTABLE_2017.sqlite
obs_T2m
## # A tibble: 68,289 x 3
## validdate SID T2m
## <int> <int> <dbl>
## 1 1496113200 1001 272.
## 2 1496113200 1002 270.
## 3 1496113200 1003 273
## 4 1496113200 1006 271.
## 5 1496113200 1007 274.
## 6 1496113200 1008 273
## 7 1496113200 1009 270.
## 8 1496113200 1010 275.
## 9 1496113200 1011 272.
## 10 1496113200 1013 273.
## # … with 68,279 more rows
You will see that obs_T2m is a simple table with columns for validdate, SID (station ID) and T2m. By default a gross error check is done on the observations to remove unrealist values. There are default values for the allowed range, but you could also set it yourself using the min_obs_allowed and max_obs_allowed arguments. The removed observations are returned as an attribute called bad_obs and we can use the attr function to see them.
attr(obs_T2m, "bad_obs")
## # A tibble: 3 x 3
## validdate SID T2m
## <int> <int> <dbl>
## 1 1496134800 40897 324.
## 2 1496145600 41287 323.
## 3 1496145600 41293 323.
We now have to join this to the forecast data matching the valid dates and station IDs. This can be done using the join_to_fcst function from harpPoint.
fcst_T2m <- fcst_T2m %>%
join_to_fcst(obs_T2m)
## Joining, by = c("SID", "validdate")
## Joining, by = c("SID", "validdate")
## Joining, by = c("SID", "validdate")
fcst_T2m
## ● model1
## # A tibble: 11,365 x 11
## SID fcdate leadtime validdate model1_mbr000 model1_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1015 1.50e9 3 1.50e9 277. 277.
## 4 1018 1.50e9 3 1.50e9 273. 273.
## 5 1023 1.50e9 3 1.50e9 275. 274.
## 6 1025 1.50e9 3 1.50e9 275. 275.
## 7 1026 1.50e9 3 1.50e9 274. 274.
## 8 1027 1.50e9 3 1.50e9 275. 275.
## 9 1033 1.50e9 3 1.50e9 276. 275.
## 10 1035 1.50e9 3 1.50e9 270. 270.
## # … with 11,355 more rows, and 5 more variables: model1_mbr002 <dbl>,
## # model1_mbr003 <dbl>, model1_mbr004 <dbl>, fcst_cycle <chr>, T2m <dbl>
##
## ● model2
## # A tibble: 11,365 x 14
## SID fcdate leadtime validdate model2_mbr000 model2_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1015 1.50e9 3 1.50e9 277. 277.
## 4 1018 1.50e9 3 1.50e9 273. 273.
## 5 1023 1.50e9 3 1.50e9 275. 275.
## 6 1025 1.50e9 3 1.50e9 275. 275.
## 7 1026 1.50e9 3 1.50e9 274. 274.
## 8 1027 1.50e9 3 1.50e9 275. 275.
## 9 1033 1.50e9 3 1.50e9 276. 275.
## 10 1035 1.50e9 3 1.50e9 270. 270.
## # … with 11,355 more rows, and 8 more variables: model2_mbr002 <dbl>,
## # model2_mbr003 <dbl>, model2_mbr004 <dbl>, model2_mbr005 <dbl>,
## # model2_mbr006 <dbl>, model2_mbr007 <dbl>, fcst_cycle <chr>, T2m <dbl>
##
## ● model3
## # A tibble: 11,365 x 14
## SID fcdate leadtime validdate model3_mbr000 model3_mbr001
## <int> <int> <int> <int> <dbl> <dbl>
## 1 1001 1.50e9 3 1.50e9 271. 272.
## 2 1010 1.50e9 3 1.50e9 276. 276.
## 3 1015 1.50e9 3 1.50e9 277. 277.
## 4 1018 1.50e9 3 1.50e9 273. 273.
## 5 1023 1.50e9 3 1.50e9 275. 275.
## 6 1025 1.50e9 3 1.50e9 275. 275.
## 7 1026 1.50e9 3 1.50e9 274. 274.
## 8 1027 1.50e9 3 1.50e9 275. 275.
## 9 1033 1.50e9 3 1.50e9 276. 276.
## 10 1035 1.50e9 3 1.50e9 270. 270.
## # … with 11,355 more rows, and 8 more variables: model3_mbr002 <dbl>,
## # model3_mbr003 <dbl>, model3_mbr004 <dbl>, model3_mbr005 <dbl>,
## # model3_mbr006 <dbl>, model3_mbr007 <dbl>, fcst_cycle <chr>, T2m <dbl>
You will now see that there are 11,365 rows of data for each model, so we have lost ~ 6,000 cases. This is because observations were not available for those cases, so they are dropped.
We can now do a final check on the observations by comparing them to the forecasts. If the observations are more than a certain number of standard deviations from the closest ensemble member they are dropped for being unrepresentatitve. This is done using the check_obs_against_fcst function. Given that we only have two days of data, the standard deviations might be too small for this to be sensible, but it wil lbe done anyway to demonstrate what you would do for larger datasets.
fcst_T2m <- fcst_T2m %>%
check_obs_against_fcst(T2m)
In the call to check_obs_against_fcst we include the argument T2m. We do not quote the argument since it refers to a column in the data. Again the removed cases are returned in an attribute, athis time called removed_cases
attr(fcst_T2m, "removed_cases")
## # A tibble: 200 x 3
## SID validdate T2m
## <int> <int> <dbl>
## 1 1015 1496221200 280.
## 2 1036 1496113200 269.
## 3 1036 1496124000 269.
## 4 1036 1496134800 270.
## 5 1036 1496145600 271.
## 6 1036 1496156400 271.
## 7 1036 1496167200 270.
## 8 1036 1496178000 270
## 9 1036 1496199600 270.
## 10 1036 1496210400 270.
## # … with 190 more rows
We have removed 200 cases, and probably a large number of them were not unrepresentative, but as mentioned, the small amount of data means that this test might be a little too strict.
Now we have all the data read in we can run the verification. This is done using the ens_verify function from harpPoint. As a minumum, the summary scores will be calculated, but if we give it some thresholds, categorical scores for those thresholds will also be calculated. Here we will get the thresholds from 25th, 50th, 75th, 90th and 95th percentiles from the observations data. We can access the T2m column in obs_T2m using $.
verif_T2m <- ens_verify(
fcst_T2m,
T2m,
thresholds = quantile(obs_T2m$T2m, c(0.25, 0.5, 0.75, 0.9, 0.95)),
show_progress = FALSE
)
The first argument to ens_verify is always the data, and the second the column in the data that we wish to verify against. Since it is a column name it is unquoted. Here show_progress = FALSE - normally progress bars are shown, but they don’t seem to behave that well on ecgate. You can experiment with removing that argument as the default value is TRUE.
Now we can look at the verification data
verif_T2m
## $ens_summary_scores
## # A tibble: 36 x 11
## mname leadtime num_cases mean_bias rmse stde spread crps
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mode… 3 1566 -0.348 1.29 1.24 0.423 0.850
## 2 mode… 6 1649 -0.311 1.19 1.15 0.459 0.776
## 3 mode… 9 770 0.00976 1.68 1.68 0.813 1.05
## 4 mode… 12 797 -0.0880 1.91 1.91 0.989 1.21
## 5 mode… 15 770 -0.357 1.91 1.88 1.05 1.19
## 6 mode… 18 829 -0.392 1.67 1.62 0.848 1.00
## 7 mode… 21 774 -0.299 1.45 1.42 0.710 0.897
## 8 mode… 24 810 -0.451 1.55 1.48 0.781 0.971
## 9 mode… 27 789 -0.478 1.40 1.32 0.742 0.881
## 10 mode… 30 831 -0.456 1.25 1.17 0.782 0.777
## # … with 26 more rows, and 3 more variables: crps_potential <dbl>,
## # crps_reliability <dbl>, rank_histogram <list>
##
## $ens_threshold_scores
## # A tibble: 180 x 17
## mname threshold leadtime economic_value sample_climatol…
## <chr> <dbl> <int> <list> <dbl>
## 1 mode… 284. 3 <tibble [20 ×… 0.334
## 2 mode… 284. 6 <tibble [20 ×… 0.415
## 3 mode… 284. 9 <tibble [20 ×… 0.588
## 4 mode… 284. 12 <tibble [20 ×… 0.645
## 5 mode… 284. 15 <tibble [20 ×… 0.640
## 6 mode… 284. 18 <tibble [20 ×… 0.546
## 7 mode… 284. 21 <tibble [20 ×… 0.412
## 8 mode… 284. 24 <tibble [20 ×… 0.340
## 9 mode… 284. 27 <tibble [20 ×… 0.332
## 10 mode… 284. 30 <tibble [20 ×… 0.391
## # … with 170 more rows, and 12 more variables: bss_ref_climatology <dbl>,
## # num_cases_for_threshold_total <int>,
## # num_cases_for_threshold_observed <int>,
## # num_cases_for_threshold_forecast <int>, brier_score <dbl>,
## # brier_skill_score <dbl>, brier_score_reliability <dbl>,
## # brier_score_resolution <dbl>, brier_score_uncertainty <dbl>,
## # reliability <list>, roc <list>, roc_area <dbl>
##
## $det_summary_scores
## # A tibble: 252 x 8
## mname leadtime member num_cases bias rmse mae stde
## <chr> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 model1 3 mbr000 1566 -0.369 1.33 1.00 1.28
## 2 model1 6 mbr000 1649 -0.315 1.22 0.925 1.18
## 3 model1 9 mbr000 770 -0.00320 1.71 1.25 1.71
## 4 model1 12 mbr000 797 -0.0778 1.97 1.48 1.97
## 5 model1 15 mbr000 770 -0.348 1.93 1.45 1.90
## 6 model1 18 mbr000 829 -0.360 1.67 1.20 1.63
## 7 model1 21 mbr000 774 -0.316 1.45 1.07 1.42
## 8 model1 24 mbr000 810 -0.462 1.56 1.19 1.49
## 9 model1 27 mbr000 789 -0.549 1.45 1.10 1.34
## 10 model1 30 mbr000 831 -0.538 1.42 1.06 1.31
## # … with 242 more rows
##
## attr(,"parameter")
## [1] "T2m"
## attr(,"start_date")
## [1] "2017053000"
## attr(,"end_date")
## [1] "2017053100"
## attr(,"num_stations")
## [1] 842
There are three tables of data - one for the ensemble summary scores, one for the ensemble threshold scores and one for the deterministic summary scores, which has summary deterministic scores for each member.
Finally we save the data using the save_point_verif function from harpIO. This function generates file names depinding on the data so you only need to give it the data and a path. If you want you can chenge the template for the file name, but this means that none of the harp shiny apps will work with your saved data.
save_point_verif(verif_T2m, verif_path = "/scratch/ms/no/fa1m/harp_output/verification")
## Saving point verification scores to:
## /scratch/ms/no/fa1m/harp_output/verification/harpPointVerif.harp.T2m.harp.2017053000-2017053100.harp.model1.model.model2.model.model3.rds
It is important to know the individual steps that go into the verification process, so that if something goes wrong you can inspect the data. Additonally you may want to plot particular cases, so you need to know how to read the data in. However, in normal circumstances you would want to to this whole process in one go. harpPoint proveids a function to do just that - ens_read_and_verify. So we can do everything that was described above in a single command:
verif_T2m <- ens_read_and_verify(
start_date = first_fcst,
end_date = last_fcst,
parameter = "T2m",
fcst_model = fcst_models,
fcst_path = fctable_dir,
obs_path = obstable_dir,
lead_time = lead_times,
thresholds = seq(285, 305, 5),
verif_path = "/scratch/ms/no/fa1m/harp_output/verification"
)
Note that in the above case, since we do not know the quantiles a priori, the thresholds are set explicitly.
In Part 3, we will make some plots of the verification scores.