Generalized Pareto Distribution (GPD)

Generalized Pareto Distribution (GPD)

Sometimes using only block maximum can be wasteful if it ignores much of the data. It is often more useful to look at exceedances over a given threshold instead of simply the maximum (or minimum) of the data. extRemes provides for fitting data to GPD models as well as some tools for threshold selection. For more information on the GPD see appendix.

Fitting Data to a GPD

The general procedure for fitting data to a GPD using extRemes is:
Example 1: Hurricane damage
For this example, load the extRemes dataset, damage.R and save it (in R) as damage. That is, Figure 5.1 shows the scatter plot of these data from 1925 to 1995. The data are economic damage of individual hurricanes in billions of U.S. dollars. These data correspond to the count data discussed in the section: Fitting data to a Poisson distribution. To learn more about these data, please see Pielke and Landsea (1998) or Katz (2002). The time series shows that there was a particularly large assessment of economic damage early on (in 1926) of over 70 billion dollars. After this time, assessments are much smaller than this value.


Figure 5.1: Scatter plot of U.S. hurricane damage (in billions $ U.S.).


Diagnostic plots for the GPD fit for these data with economic damage, Dam, as the response variable and a threshold of 6 billion dollars are shown in Figure 5.2. The fit looks pretty good considering the one rather large outlier from 1926 and only 18 values over the threshold. The histogram in Figure 5.2 appears to include all of the data, and not just data above the threshold. However, this is simply a result of the binning algorithm used; in this case the default Sturges algorithm. The same histogram can be plotted, with this or a choice of two other algorithms: Scott or Friedman-Diaconis in the following manner. The histogram shown in Figure 5.3 used the Friedman-Diaconis algorithm. Each choice of breaks algorithm is simply a different algorithm for binning the data for the histogram. The histogram of Figure 5.3 is still a little misleading in that it looks like the lower end point is at 5 billion dollars instead of 6 billion dollars and that it still does not appear to be a good fit to the GPD. In such a case, it is a good idea to play with the histogram in order to make sure that this appearance is not simply an artifact of the R function, hist, before concluding that it is a bad fit. In fact, the histogram shown in Figure 5.4 looks better. It is currently not possible to produce this histogram directly from extRemes. This histogram was produced in the following manner. From the R prompt:

max( damage$models$gpd.fit1$dat)
[1] 72.303
brks seq(6, 72.303, ,15)
hist( damage$models$gpd.fit1, breaks=brks)
See the help file for the R function hist for more details about plotting histograms in R. That is, from the R prompt type:
help( hist)


Figure 5.2: GPD fit for hurricane damage data using a threshold of 6 billion dollars.


For these data, 4.6 billion dollars (1.82 billion dollars) and 0.5 (0.340). The model has an associated negative log-likelihood of about 54.65.


Figure 5.3: Histogram for GPD fit for hurricane damage data using a threshold of 6 billion dollars and the Friedman-Diaconis algorithm for bin breaks.



Figure 5.4: Histogram for GPD fit for hurricane damage data using a threshold of 6 billion dollars and a specialized vector for the breaks. See text for more details.


Example 2: Fort Collins Precipitation Data
An example of a dataset where more information can be gathered using a threshold exceedance approach is the Fort Collins precipitation dataset. Read in the file FtCoPrec.R from the data directory in the extRemes library and assign it to an object called Fort--it may take a few seconds to load this relatively large dataset. This dataset has precipitation data for a single location in Fort Collins, C.O., USA for the time period 1900-1999. These data are of special interest because of a flood that occurred there on July 28, 1997. See Katz et al. (2002) for more information on these data. Figure 5.5 shows a scatter plot of the daily precipitation (by month) at this location. Using extRemes:


Figure 5.5: Scatter plot of observed daily precipitation (inches) values by month for a Fort Collins, C.O. rain gauge.


To fit a GPD model using the toolkit do the following. The threshold of 0.395 inches is used as in Katz et al. (2002). A plot similar to that of Figure 5.6 should appear along with summary statistics for the GPD fit in the main toolkit window. This fit yields MLE's of 0.32 inches (0.016 inches), 0.21 (0.038), and a negative log-likelihood of about 85. Note that we are ignoring, for now, the annual cycle that is evident in Figure 5.5.


Figure 5.6: Diagnostic plots for the GPD fit of the Fort Collins, C.O. Precipitation data using a threshold of 0.395 in.


Figure 5.6 can be reproduced at any time in the following way. Figure 5.7 shows a histogram of the data along with the model fit using the Friedman-Diaconis algorithm for binning (see the help file for hist in R for more details).


Figure 5.7: Histogram of GPD fit to Fort Collins precipitation (inches) data using the Friedman-Diaconis algorithm for determining the number of breakpoints.


The general procedure for plotting a histogram of a fitted GPD function using extRemes is (identical to that of the GEV):

Back to Top

Return level and shape parameter () (1-)% confidence bounds

Confidence intervals may be estimated using the toolkit for both the return level and shape parameter () of both the GEV and GP distributions. See Return level and shape parameter () (1-)% confidence limits (for the GEV) for more information on how the confidence intervals are obtained.
Example: Fort Collins precipitation data
To estimate the confidence limits for the GPD shape parameter using extRemes: Confidence intervals (in this case 95%) are shown in the main toolkit dialog. For the 100-year return level they are approximately (4.24, 6.82) inches and for the shape parameter about 0.12 to 0.27, consistent with the shape parameter being greater than zero. Visual inspection of the dashed vertical lines in Figure 5.8 act as a guide to the accuracy of the displayed confidence limits; here the estimates shown appear to be accurate because the dashed vertical lines (for both parameters) appear to intersect the profile likelihood in the same location as the (lower) horizontal line. Note that the confidence interval for the 100-year return level includes 4.63 inches, the amount recorded for the high precipitation event of July 1997.


Figure 5.8: Profile log-likelihood plots for GPD 100-year return level (inches) and shape parameter () for Fort Collins, C.O. precipitation data.


Back to Top

Threshold Selection

Threshold selection is an important topic, and still an area of active research. It is desired to find a threshold that is high enough that the underlying theoretical development is valid, but low enough that there is sufficient data with which to make an accurate fit. That is, selection of a threshold that is too low will give biased parameter estimates, but a threshold that is too high will result in large variance of the parameter estimates. Some useful descriptive tools for threshold selection are included with extRemes. Specifically, the mean excess, or mean residual life, plot and another method involving the fitting of data to a GPD several times using a range of different thresholds.

Back to Top

Threshold Selection: Mean Residual Life Plot

Mean residual life plots, also referred to as mean excess plots in statistical literature, can be plotted using extRemes. For more information on the mean residual life plot (and threshold selection) see appendix. The general procedure for plotting a mean residual life plot using extRemes is:
Example: Fort Collins precipitation
Figure 5.9 shows the mean residual life plot for the Fort Collins, C.O. precipitation dataset. Interpretation of a mean residual life plot is not always simple in practice. The idea is to find the lowest threshold where the plot is nearly linear; taking into account the 95% confidence bounds. For the Fort Collins data, it is especially difficult to interpret, which may be because of the annual cycle (seasonality) that is being ignored here. Nevertheless, the plot appears roughly linear from about 0.3 to 2.5 inches and is erratic above 2.5 inches, so 0.395 inches is a plausible choice of threshold.


Figure 5.9: Mean Residual Life Plot of Fort Collins precipitation data. Thresholds (u) vs Mean Excess precipitation (in inches).




To plot Figure 5.9 using extRemes:

Back to Top

Threshold Selection: Fitting data to a GPD Over a Range of Thresholds

The second method for trying to find a threshold requires fitting data to the GPD distribution several times, each time using a different threshold. The stability in the parameter estimates can then be checked. The general procedure for fitting threshold ranges to a GPD is:
Example: Fort Collins precipitation
Figure 5.10 shows plots from having fit the GPD model for a range of 50 thresholds from 0.01 inches to 1 inch for the Fort Collins precipitation data (see Fitting Data to a GPD section for more information on these data). Figure 5.10 suggests that, for the GPD model, a threshold of 0.395 inches is appropriate.


Figure 5.10: GPD fits for a range of 50 thresholds from 0.01 inches to 1 inch for the Fort Collins precipitation dataset.


To create the plot from Figure 5.10 using extRemes, do the following. Note that different values may be tried here as well, but the program will fail for certain choices. Keep trying different threshold ranges until it works.




8For the Fort Collins, C.O. precipitation data the MLE for the 100-year return level is near 5 inches and 0.19, so a good search range for the confidence limits would include 5 and be wide enough to capture the actual limits. If any of the search range fields are left blank, extRemes will try to find a reasonable search limit (for each field left blank) automatically. It is a good idea to check the plot profile likelihoods checkbutton when searching for ranges automatically. This way, the profile likelihoods with vertical dashed lines at estimated limits will be displayed; if dashed lines intersect profile at lower horizontal line, then the estimate is reasonably accurate. For this example, 4 to 7 inches are used for the 100-year return level and 0.1 to 0.3 for the shape parameter.




Back to Top

Back to Table of Contents