Final Ozone Forecast Report

advertisement

Ozone Forecasting in Philadelphia – 2007

Final Report

Prepared for the

Delaware Valley Regional Planning Commission

William F. Ryan

Tiffany Wisniewski

Amy Huff

Department of Meteorology

The Pennsylvania State University

University Park, PA 16802

October 9, 2007

Executive Summary

1. Daily public ozone (O

3

) forecasts for the Philadelphia non-attainment area, including southern New Jersey, northern Delaware and extreme northeastern Maryland, were issued for the period May 1, 2007 to September 8, 2007. Five Code Red and 16

Code Orange O

3 cases occurred during 2007. This is slightly higher than the average for the 2003-2006 period (18 days ≥ Code Orange) but well below the average for the earlier

1994-2002 period (36 days ≥ Code Orange).

2. The summer of 2007 was slightly warmer and drier than average in the

Philadelphia metropolitan area. The number of hot (≥ 90°F) days, generally conducive to high O

3

, was near average compared to the preceding decade. Locations south and southwest of Philadelphia, however, were significantly hotter and drier than normal.

3. The correct AQI forecast color code for O

3

was issued in 68% of all cases. Of the missed forecasts, the majority (60%) were errors between Code Green and Code

Yellow. This performance was less than the 2003-2006 average correct code percentages of 74%.

4. Forecasts were accurate for the high O

3

cases (Code Orange or Code Red).

Code Orange or Code Red was observed on 21 days with 16 of these cases (76%) carrying forecasts of Code Orange and Air Quality Action Day warnings. All of the 10 most severe O

3

days carried Air Quality Action Day warnings.

5. As in previous forecast seasons, Code Orange or higher forecasts (Ozone

Action Days) were also accurate. In 2006, 23 Ozone Action Days were forecast with 16

(70%) observing Code Orange or higher.

6. The median absolute error (MAE) of the numeric O

3

forecasts was 7.0 parts per billion by volume (ppbv) or within 10.6% of observed O

3

. This was slightly worse than the average MAE for the previous four seasons of 6.2 ppbv. The forecasts showed a large increase in skill over standard benchmark measures – a 39% skill improvement over persistence (the previous day’s observations used as the current day forecast).

7. As part of a cooperative agreement, the National Weather Service and the U.S.

EPA have developed a coupled chemistry-weather model to predict O

3

. This model has been continuously updated and, in 2007, performed well for the Philadelphia area.

Research undertaken during 2007 suggests that a combination of numerical model forecasts and stand-alone statistical models can provide increased skill in high O

3 cases.

Overview: Weather and Ozone 2007

Daily ozone forecasts for the Philadelphia non-attainment area, including southern

New Jersey, northern Delaware and extreme northeastern Maryland, were provided for the period May 1, 2006 to September 8, 2007. A time series of daily peak O

3

and forecast O

3

for this period ( Figure 1 ) shows the usual large oscillations in daily O

3

along with intermittent multi-day periods of persistent high and low concentrations. These multi-day events are driven by synoptic scale (2-5 day) weather patterns. Five Code Red

and 16 Code Orange cases occurred during 2007 ( Figure 2 ).

The summer of 2007 was slightly warmer and drier than average. At Philadelphia

International Airport (PHL), the summer season (June-August) average temperature was

0.6°F above normal and precipitation for the same period was 1.1” below normal.

Differences from normal in temperature and precipitation across the region are shown in

Figure 3 and Figure 4 . Extremely dry and hot weather prevailed over most of the region

to our south and southwest.

As noted in earlier reports, hot weather is necessary but not sufficient for high O

3

.

In 2007, hot (≥ 90º F) days were frequent (23 days) but not significantly different from the longer term average frequency (1997-2008) of 25 days. Overall, for the period 1997-

2006, 75% of all days with maximum temperature ≥ 90º F also reach the Code Orange (≥

85 ppbv for an 8-hour average) threshold although a much lower percentage (26%) reach

Code Red. This time period straddles the onset of regional NO x

emissions controls as part of the so-called “NO x

SIP Rule”. These controls, primarily on large stationary power

generation units, were phased in over the period 2003-2005. As seen in Table 1 , there is

evidence that these regional emissions reductions may influence the O

3

-temperature relationship, particularly in the high temperature extreme, with fewer hot cases also observing high O

3

. Theoretically, regional NO x

controls serve to reduce the

“background” or regional scale, O

3

concentrations on which the urban emissions build.

For the period 1997-2002, 83% of hot days also observed high O

3

while for the 2003-

2007 period, the frequency fell to only 58%. A longer time series is necessary, however, to reach a firm conclusion on this matter.

Observed O

3

and Forecast Performance

Observed O

3

Daily mean peak O

3

for the summer of 2007 was 66.0 ppv. This is above average for the most recent period (2003-2006) of 63.6 ppbv but is below the longer term average

(1994-2002) of 69.9 ppbv ( Figure 5 ). There were 21 days above the Code Orange

standard. As is typically the case, nearly three-quarters (71%) of the high O

3

cases were

organized into multi-day episodes: May 25-27, May 30-June 1, June 26-27, August 2-4,

August 7-8 and August 15-17. Analyses of these multi-day episodes are ongoing and preliminary data and discussions can be found at: http://www.meteo.psu.edu/~wfryan/reports/2007-episodes.htm

.

The frequency of occurrence of the various color codes is shown in Figure 6 .

Conditions in 2007 were similar in most respects to the recent past (2003-2006) as can be

seen by comparison to Figure 7 . The most frequent color code was green (good air

quality). While the total frequency of all cases above the Action Day threshold (Code

Orange) was consistent with past years (~15%), there were a larger fraction of Code Red cases in 2007 – 4% compared to 2%. Compared to earlier years (1994-2002), however, the number of Code Orange or higher cases was lower (23% in 1994-2002) and the

number of Code Green cases higher (46% in 1994-2002) ( Figure 8 ).

Color Code Forecasts

During the 2007 O

3

season forecasts were issued daily from May to September.

The forecasts were issued as a color code to the public with numerical (ppbv) forecasts prepared internally as a tool for quantitative determination of forecast skill. The correct

color code was issued in 68% of all cases ( Table 2 ). Of the missed code forecasts, the

majority (60%) were errors between Code Green and Code Yellow. This performance was less than the 2003-2006 correct code percentages of 74%.

For the high O

3

cases (Code Orange or Code Red), the forecasts were accurate.

Code Orange or Code Red was observed on 21 cases with 16 of these cases (76%) carrying forecasts of Code Orange or higher. This represents a slight improvement over the 2003-2006 average of 71%. As has been the case in previous forecast seasons, Code

Orange or higher forecasts (Ozone Action Days) were accurate. In 2006, 23 Ozone

Action Days were forecast with 16 days (70%) observing Code Orange or higher.

However, this was less than the recent (2003-2006) average of 78%. Thus, while the number of bad O

3

days were more likely to be identified, it was at the cost of a higher than normal number of “false alarms”. This trade-off of more hits for more false alarms is typical for categorical forecasts but further off-season analyses will be undertaken to determine if the false alarm forecasts followed any particular pattern.

In addition to an increased frequency of false alarms of Code Orange conditions, another important shortcoming of the 2007 color code forecasts was the inability to forecast Code Red conditions. All of the Code Red observed cases were covered with

Air Quality Action Day warnings but all at the Code Orange range. Again, further analysis will be undertaken to determine the causes of under-prediction in these cases.

ppbv Forecasts

For the 2007 season, the numerical forecasts had a slight bias toward over-

prediction (+2.9 ppbv) ( Table 3 ). The median absolute error (MAE) for the consensus

(public) forecasts was 7.0 ppbv. This was consistent with the previous four-season average. Forecasts prior to 2003 were issued for peak 1-hour O

3

and so are not directly comparable. The root mean square (rms) error, a measure of forecast consistency, was

11.2 ppbv which is higher than the 2003-2006 average of 10.5 ppbv. The forecasts showed a large increase in skill over standard benchmark measures – a 39% skill improvement over persistence (the previous day’s observations used as the current day

forecast). Forecast results for 2007 were roughly consistent with preceding years ( Table

4 ).

Statistical Model Performance .

The forecasts issued to the public are based on statistical forecast models that have been in use for several years. The statistical models are strongly influenced by forecasted temperature and cannot completely account for other meteorological effects

(e.g., thunderstorms) that greatly affect O

3

concentrations.

Currently two statistical models are in use in Philadelphia (R0302 and LN2001).

Both models use multiple linear regression methods with the key predictors typically including temperature, relative humidity, wind at the surface and aloft, along with previous day peak O

3

and climatology. The R0302 model performs best overall and, in

2007, forecast the correct color code in 64% of all cases. As has been the case historically, this is slightly less than the public forecast skill. As with the public forecasts, statistical model skill in forecasting color codes was less in 2007 than in earlier years

(70% correct for 2003-2006).

Quantitative statistics for the consensus (public) forecast and statistical guidance

are given in Table 3 . The statistical models have a higher bias toward over-prediction

than the public forecasts. The mean absolute error and rms error were also higher for the statistical guidance suggesting inconsistent skill. The bias in the statistical models has

been growing over the past several years ( Table 5 ) and this may also reflect changes in

the O

3

-temperature relationship discussed above.

Forecast Performance in High O

3

Cases

The most important measure of forecast skill is performance in the higher end of the O

3

distribution. These cases, involving observations and forecasts of O

3

in the Code

Orange (Air Quality Action Day) range, have the greatest implications for pollution control strategies and public health.

In Table 6 , a summary of forecast performance in

observed cases of Code Orange or higher O

3

is given. The statistical guidance showed little bias while the NOAA-EPA forecast model (discussed in more detail below) and the public forecast showed a low bias. For all other quantitative error measures (MAE, mean absolute error and rms error), the statistical model guidance performed as well or better than the public forecast. This is an unusual outcome as the public forecasts typically perform better than the statistical models. The NOAA-EPA model also performed well in terms of overall statistics but did a poorer job identifying the Code Orange cases.

The complementary measure of forecast skill in high O

3

cases is performance in cases of forecast high O

3

. These results are shown in Table 7 . While the statistical

models performed well in observed high O

3

cases, they did so at the cost of an increased number of false alarms with a rate of success of 56% compared to 70% for the public forecast. The public forecast added value in 2007 primarily by reducing the number of possible false alarms.

A variety of forecast skill measures for observed and forecast Code Orange cases

are given in Table 8 . These scores measure the skill of the forecasts at accurately

predicting cases above a threshold concentration. In this case, the threshold is 85 ppbv for an 8-hour maximum (Code Orange, the Air Quality Action Day threshold). The skill

scores are described in more detail in Appendix A . Overall, the public forecast was less

conservative than previous years. That is, more false alarms were issued (FAR = 0.30 compared to 0.19 in 2006) but fewer Code Orange cases were missed (POD = 0.76 compared to 0.68 in 2006). The bias measure is > 1.00 (1.10) which suggests a tendency to over-predict at this threshold. Overall skill is best measured by the Conditional Skill

Index (CSI) and Heidke Skill Score (HSS) measures. As Code Orange O

3

is a relatively rare event, the most frequent category is the “correct null” forecast, that is, low O

3 forecast and observed. These forecasts are typically easy to make and not of critical interest. The CSI calculation removes the influence of the correct null forecast. For 2007, the public forecast skill score of 0.57 (CSI has a range [-1,1]) is good. The HSS score compares the forecasts to a randomly generated set of forecasts that are constrained to have the same number of cases in each color code as observed. Again, on the range of [-

1,1], the HSS of 0.67 is a good result.

Numerical Forecast Guidance

As part of a cooperative agreement, the National Weather Service and the U.S.

EPA have developed a coupled chemistry-weather model to predict O

3

(see, http://www.nws.noaa.gov/ost/air_quality/index.htm

). This model was tested in 2003 and

2004 and became fully operational in 2005. The NOAA/EPA model (NAQFS) has shown reasonable skill overall and was able to predict some of the high O

3

cases in 2005.

In 2006, we undertook a preliminary assessment of the numerical model forecast performance in the Philadelphia area and found that it showed some skill for local forecasting. In 2007, we took a closer look at model performance in the Philadelphia area.

Selected forecast results for the Philadelphia area are shown in Table 9 . For the

overall forecasts, the NAQFS performs slightly better than the statistical forecasts. This is the first time we have seen this result. In the hot weather cases, the NAQFS also performs better than the statistical forecasts which tend to over-predict O

3

due to the strong influence of maximum temperature as a predictor. The NAQFS still has problems resolving the high O

3

cases. For observed cases of Code Orange or higher O

3

, the

NAQFS under-predicts strongly and, while the MAE is consistent with the statistical forecasts, the mean and rms error are higher. This suggests a lack of consistency in the forecast or, in other words, a tendency to have occasional very bad forecasts. As with the statistical models, the NAQFS tends to over-predict the frequency of Code Orange cases

but at a slightly lower rate. In terms of the key skill score measures ( Table 8 ), the

NAQFS still lags the statistical models by ~14%. However, the skill of the NAQFS appears to be increasing as the model is updated and refined and we will continue to use its output as part of the forecast analysis process.

One forecast approach that showed promise in 2007 was a combination of statistical and numerical forecast guidance. This “hybrid” model simply takes a weighted average of the statistical models and the NAQFS numerical model output. For this analysis, each of the two statistical models was given a 25% weight and the NAQFS a

50% weight. Selected results for the “hybrid” model are given in Table 10 . The result of

most interest is that the MAE is the same for both observed and forecast high O

3

cases.

The MAE for high forecast O

3

is also much lower than for any of the models applied

singly (compare, Table 7

). This result is shown graphically in Figure 9 . Skill scores for

the “hybrid” model in categorical forecasts of Code Orange or higher is shown in

Table

10

. Again, the “hybrid” model performs better on nearly all measures than its

constituents and rivals the public forecast skill. This result is of interest to the forecasters and will be tested operationally during the 2008 season.

Conclusions

Daily public ozone (O

3

) forecasts for the Philadelphia non-attainment area, including southern New Jersey, northern Delaware and extreme northeastern Maryland, were issued for the period May 1, 2007 to September 8, 2007. Five Code Red and 16

Code Orange O

3 cases occurred during 2007. This is slightly higher than the average for the 2003-2006 period (18 days ≥ Code Orange) but well below the average for the earlier

1994-2002 period (36 days ≥ Code Orange).

The summer of 2007 was slightly warmer and drier than average in the

Philadelphia metropolitan area. The number of hot (≥ 90°F) days, generally conducive to high O

3

, was near average compared to the preceding decade. Locations south and southwest of Philadelphia, however, were significantly hotter and drier than normal.

The correct AQI forecast color code for O

3

was issued in 68% of all cases. Of the missed forecasts, the majority (60%) were errors between Code Green and Code

Yellow. This performance was less than the 2003-2006 average correct code percentages

of 74%. Forecasts were accurate for the high O

3

cases (Code Orange or Code Red).

Code Orange or Code Red was observed on 21 days with 16 of these cases (76%) carrying forecasts of Code Orange (Air Quality Action Day). All of the 10 most severe

O

3

days carried Air Quality Action Day warnings. As in previous forecast seasons, Code

Orange or higher forecasts (Ozone Action Days) were also accurate. In 2006, 23 Ozone

Action Days were forecast with 16 (70%) observing Code Orange or higher.

The median absolute error (MAE) of the numeric O

3

forecasts was 7.0 parts per billion by volume (ppbv) or within 10.6% of observed O

3

. This was slightly worse than the average MAE for the previous four seasons of 6.2 ppbv. The forecasts showed a large increase in skill over standard benchmark measures – a 39% skill improvement over persistence (the previous day’s observations used as the current day forecast).

As part of a cooperative agreement, the National Weather Service and the U.S.

EPA have developed a coupled chemistry-weather model to predict O

3

. This model has been continuously updated and, in 2007, performed well for the Philadelphia area.

Research undertaken during 2007 suggests that a combination of numerical model forecasts and stand-alone statistical models can provide increased skill in high O

3 cases.

Philadelphia Forecast Area

Days ≥ 90ºF and O

3

Concentration Threshold

Number of Cases Percent Above Threshold

≥ 85 ppbv

≥ 105 ppbv

1997-2002

134

57

83%

35%

≥ 85 ppbv

≥ 105 ppbv

2003-2006

53

9

2007

60%

10%

≥ 85 ppbv

≥ 105 ppbv

12

5

52%

22%

Table 1 . Frequency of 8-hour peak O

3

above the Code Orange and Code Red thresholds in the Philadelphia metropolitan area during hot days.

Color Code Forecasts

Philadelphia Forecast -2007

Observed

Green Yellow Orange Red

49 6 0 0

Forecast

Green

Yellow

Orange

19

1

0

29

6

0

5

11

0

0

5

Red 0

Table 2 . Color code forecasts and observations for the Philadelphia forecast area – 2007.

Summary Statistics

Forecasts and Statistical Guidance

Philadelphia Forecast-2007

(all measures in ppbv)

Obs Forecast R0302 LN2001 Persistence

66.0 68.9 72.4 73.0 65.9 Mean

Bias

Mean AE

+2.9

8.7

7.0

+6.4

10.4

8.8

+7.0

10.9

8.6

+0.3

14.5

11.5 Median AE

Rms error 11.2 13.3 13.8 18.7

Table 3 . Summary statistics for the public forecast, two statistical models (R0302 and

LN2001), and the reference persistence forecast.

Forecast Statistics

Philadelphia Metropolitan Forecast Area (2003-2007)

Bias (ppbv)

2007

+2.9

8.7

2006

+2.6

6.9

2005

+1.6

8.8 Mean AE (ppbv)

Median AE (ppbv) rms error (ppbv)

Days ≥ 90 F (JJA)

Days ≥ 85 ppbv

7.0

11.2

22

21

5.0

11.3

23

20

7.0

12.1

28

20

Table 4 . Public forecast results for the 2003-2007 seasons.

2004

+1.6

7.0

6.0

8.8

9

8

2003

+0.4

7.3

7.0

9.0

23

15

Bias (ppbv)

Statistical Forecast Model Results

Philadelphia Metropolitan Forecast Area (2003-2007)

2007

+6.4

2006

+3.8

2005

+3.9

2004

+1.7

Mean AE (ppbv)

Median AE (ppbv) rms error (ppbv)

10.4

8.8

13.3

9.1

7.4

11.6

9.5

7.0

13.0

Table 5

. As in Table 4 but for statistical forecast model guidance.

7.6

6.0

10.1

2003

+1.3

7.6

6.4

9.7

Median (ppbv)

Mean (ppbv)

Philadelphia Forecast Area -2007

Observed High O

3

Cases (n= 21)

Observed Forecast R0302 LN2001 NAQ Persistence

94 90.0 94.3 93.1 88.0 84.0

97.4 89.6

-4.0

9.0

10.1

91.7

+0.3

8.0

9.0

91.8

-0.9

4.6

8.0

89.1

-8.3

6.0

10.1

84.2

-9.8

14.0

18.3

Bias (ppbv)

Median AE

Mean AE (ppbv) rms error (ppbv)

Orange Fcst/Obs

12.9

76%

12.3

76%

11.4

81%

13.6

67%

22.6

38%

Table 6 . Summary forecast statistics for public forecasts, statistical model guidance, numerical model guidance and the reference persistence forecast for cases in which observed peak O

3

in the Philadelphia area reached or exceeded the Code Orange AQI threshold (85 ppbv for an 8-hour average). NAQ refers to the NOAA-EPA forecast model.

Philadelphia Forecast Area -2007

Forecast High O

3

Cases

Number of Days

Median (ppbv)

Mean (ppbv)

Forecast R0302 LN2001 NAQ Persistence

23 28 30 27 21

94.0 94.4 93.0 92.0 90.0

93.6 94.2 93.6 93.9 94.4

Bias (ppbv)

Median AE

Mean AE (ppbv) rms error (ppbv)

Correct Code

+0.6

10.0

11.6

14.6

70%

+5.6

9.2

11.5

14.7

57%

+5.0

9.7

11.8

15.0

57%

+7.8

8.0

13.6

18.0

52%

+14.1

13.0

16.8

20.8

38%

Table 7 . Summary forecast statistics for consensus and statistical model guidance for cases in which peak O

3

in the Philadelphia area was forecast to reach or exceed the Code

Orange AQI threshold (85 ppbv for an 8-hour average).

POD

Miss

Skill Scores

Philadelphia Forecast Area - 2007

Fcst

0.76

R0302 LN2001 Persistence NAQ

0.76 0.81 0.38 0.67

FAR

Accuracy

CNull

0.24

0.30

0.91

0.95

0.24

0.43

0.87

0.95

0.19

0.43

0.87

0.96

0.62

0.62

0.80

0.88

0.33

0.48

0.85

0.93

Bias

CSI

1.10

0.57

0.70

1.33

0.48

0.65

1.43

0.50

0.69

1.00

0.24

0.26

1.29

0.41

0.55 TSS

HSS 0.67 0.58 0.59 0.26 0.49

Table 8 . Skill score measures for the consensus forecast, statistical guidance, reference

(persistence) forecast and NOAA-EPA numerical model. All skill measures are detailed in Appendix A. POD = Probability of Detection; Miss = Miss Rate; FAR = False Alarm

Rate; CNull = Correct Null Forecast; CSI = Critical Success Index, or Threat Score;

TSS = True Skill Statistic, also known as Peirce score, Hanssen-Kuipers discriminant

HSS = Heidke Skill Score.

Forecast Skill of NOAA-EPA Forecast Model

Philadelphia Metropolitan Area - 2007

N

All Cases 131

Observed ≥ 85 ppbv 21

Forecast ≥ 85 ppbv

27

T max

≥ 90°F 23

Bias

+2.4

-8.3

+7.8

-0.5

Mean AE

10.1

10.1

13.6

12.0

Median AE

8.0

6.0

8.0

9.0 rms Error

13.2

13.6

18.0

14.3

Table 9 . Forecast skill measures for the NOAA-EPA forecast model.

Forecast Skill of “Hybrid” Forecast Model

Philadelphia Metropolitan Area - 2007

N

All Cases 131

Observed ≥ 85 ppbv

21

Forecast ≥ 85 ppbv 25

T max

≥ 90°F 23

Bias

+4.6

-7.0

+1.9

+1.2

Mean AE

9.1

8.6

11.1

10.8

Median AE

6.2

5.6

5.6

8.9

Rms Error

11.8

11.4

14.9

13.6

Table 10.

Forecast skill measures of the “hybrid” model combining a weighted average of statistical forecast models and the NOAA-EPA operational numerical forecast model.

POD

Skill Scores

Philadelphia Forecast Area - 2007

Fcst

0.76

R0302

0.76

LN2001

0.81

Hybrid

0.81

NAQ

0.67

Miss

FAR

0.24

0.30

0.91

0.24

0.43

0.87

0.19

0.43

0.87

0.19

0.32

0.33

0.48

0.85 Accuracy

CNull

Bias

CSI

TSS

HSS

0.95

1.10

0.57

0.70

0.67

0.95

1.33

0.48

0.65

0.58

0.96

1.43

0.50

0.69

0.59

0.91

0.96

1.19

0.59

0.74

0.68

0.93

1.29

0.41

0.55

0.49

Table 11

. Skill score measures, as in Table 8 , but now including results from the

“hybrid” forecast model (bold figures).

80

60

40

20

140

120

100

Observed

Forecast

0

5/

1

5/

8

5/

15

5/

22

5/

29 6/

5

6/

12

6/

19

6/

26 7/

3

7/

10

7/

17

7/

24

7/

31 8/

7

8/

14

8/

21

8/

28 9/

4

Date (2007)

Figure 1 . Time series of daily peak 8-hour O

3

in the Philadelphia metropolitan area and forecast peak O

3

.

60

50

40

30

20

10

0

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Figure 2 . Number of days in Philadelphia during the summer forecast season exceeding the Code Orange threshold (85 ppbv, 8-hour average). For the period 1994-2002, the average number of days above this threshold is 36, for the more recent period, 2003-2007, the average is 18.

Figure 3 . Mean temperature for June-August, 2007 expressed as a percentile compared to the 1971-2000 period. Figure courtesy of NOAA Climate Diagnostics Center.

Figure 4 . As in Figure 3 but for precipitation.

72

70

68

76

74

66

64

62

60

58

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Figure 5 . Mean peak daily O

3

for the Philadelphia forecast area. For the period 1994-

2002, the average is 70.0 ppbv, and for the period 2003-2007 the average is 63.8 ppbv.

4%

12%

53%

31%

Figure 6 . Frequency of occurrence of O

3

concentrations in the Philadelphia forecast area by color code for 2007.

13%

2%

53%

32%

Figure 7 . As in Figure 6, but for the period 2003-2006.

6%

17%

46%

31%

Figure 8 . As in Figure 6, but for the period 1994-2002.

7

6

5

4

3

2

10

9

8

Hybrid

NAQFS

R0302

1

0

All Obs ≥ 85 FC ≥ 85

Forecast

Figure 9 . Median absolute error forecasts statistics for the “hybrid” forecast model (blue bar), the NAQFS numerical model (magenta bar) and one of the statistical models

(R0302, yellow bar) for 2007.

Appendix A. Skill of Threshold Forecasts

The determination of the skill of a threshold forecast begins with the creation of a contingency table of the form:

Contingency Table for Threshold Forecasts

Forecast

Observed Yes

Yes a c

No b d No

For example, if Code Red O

3

concentrations are both observed and forecast

(“hit”), then one unit is added to “a”. A “false alarm”, Code Red forecast but not observed, is added to “c”.

Basic Skill Measures:

A basic set of skill measures are determined and then used as the basis for further analysis.

Bias (B) = a a

 b c

= 0.69

Bias determines whether the same fraction of events are both forecast and observed. If B = 1, then the forecast is unbiased. If B < 1 there is a tendency to underpredict and if B > 1 there is a tendency to over-predict. In this case B = 0.69 which points to a tendency to under-predict Code Red cases.

False Alarm Rate (F) = b b

 d

= 0.02

This is a measure of the rate at which false alarms (high O

3

forecast but not observed) occur. For OAD forecasts, this rate is extremely low.

Hit Rate (H) = a a

 c

= 0.78

The hit rate is often called the “probability of detection”

Miss Rate = 1 – H = 0.22

Correct null forecasts:

Correct Null (CNull) = c d

 d

Accuracy:

Accuracy (A) = a

 d a

 b

 c

 d

= 0.95

= 0.78

Other Measures:

Generalized skill scores (SS ref

) measure the improvement of forecasts over some given reference measure. Typically the reference is persistence (current conditions used as forecast for tomorrow) or climatology (historical average conditions). Because O

3 concentrations have a strong seasonal cycle and concentrations have decreased significantly from the late 1980’s, climatology in this case is based on a ten-day running average, centered on the day of interest, from the period 1994-2004.

MSE

Skill Score (SS ref

) = (1 -

MSEref

) * 100% = 53%

The skill score is typically reported as a percent improvement over the reference forecast.

Additional measures of skill can be determined. The Heidke skill score (HSS) compares the proportion of correct forecasts to a no skill random forecast. That is, each event is forecast randomly but is constrained in that the marginal totals (a + c) and (a + b) are equal to those in the original verification table.

HSS =  a

 c

 c

2

 d ad

  bc a

 b

 b

 d

 = 0.61

For this measure, the range is [-1,1] with a random forecast equal to zero. The

Code Red forecast show skill by this measure.

Another alternative is the critical success index (CSI) or the Gilbert Skill Score

(GSS) also called the “threat” score .

CSI = a a

 b

 c

=

1

H

B

H

= 0.47

For this measure, the range is [0,1]. Since the correct null forecast is excluded, this type of measure is effective for situations like tornado forecasting where the occurrence is difficult to determine due to observing bias, i.e., tornados may occur but not be observed. This can also be the case for air quality forecasting when the monitor network is less dense. Note, however, that the random forecast will have a non-zero skill.

The

Peirce skill score (PSS), also known as the “true skill statistic”

is a measure of skill obtained by the difference between the hit rate and the false alarm rate:

PSS =  a ad

 c



 b bc

 d

 = H – F = 0.74

The range of this measure is [-1,1]. If the PSS is greater than zero, then the number of hits exceeds the false alarms and the forecast has some skill. Note, however, that if d is large, as it is in this case, the false alarm value (b) is relatively overwhelmed.

The advantage of the PSS is that determining the standard error is relatively easy.

References

Stephenson, D. B., Use of the “odds ratio” for diagnosing forecast skill, Wea.

Forecasting , 15 , 221-232, 2000.

Wilks, D. S., Statistical Methods in the Atmospheric Sciences , Academic Press, 467pp.,

1995.

Download