You are on page 1of 12

42

JOURNAL OF CLIMATE

VOLUME 11

A Hidden Markov Model for Rainfall Using Breakpoint Data


JOHN SANSOM
National Institute of Water and Atmospheric Research, Wellington, New Zealand (Manuscript received 24 September 1996, in nal form 14 February 1997) ABSTRACT Pluviographs, which are rainfall accumulationtimeplots, indicate a strong tendency for rainfall intensity to abruptly change from one steady rate of fall to another with these steady rates persisting for some time. Digitizing from pluviographs the times of change from one steady rain rate to another yields breakpoint data, that is, a stream of data pairs consisting of the rainfall rate, which includes zero, and the duration of that rate. Breakpoints provide a complete record of rainfall with information on the rain rates and their durations during periods of continuous steady precipitation and on the durations of dry periods. In a hidden Markov model (HMM), the state of the process at a given time is not known; only the values of the observables, and the range of possible states, are known. For rainfall, there is a hierarchy of states: a precipitation event is either taking place, or not; if one is, then there are episodes when the mechanism is convection (showers) and when it is large-scale uplift (rain); and nally, the current rate of rainfall and its duration will have particular values with periods of zero rate being the dry periods within an episode of a particular mechanism. Thus, there are ve states: the time between events when no precipitation is possible, showery times when a shower is taking place, showery times when no shower is taking place, rain times with rain taking place, and dry intervals during a rainy time. Such a model was initially tted using the expectation maximization (EM) algorithm, but the parameters were reestimated using HMM tting procedures, which also provided estimated probabilities of the transition matrix. The Viterbi algorithm was used to classify the individual points in the data stream. The rate and duration distributions parameters, the state transition probabilities, and the classication of the data accord with the view that during widespread rain there may be many changes of rain rate but little dry time, while during showers, shorter periods of steady precipitation tend to be interspersed with longer dry periods. Discrepancies were found between the data and simulations made using the HMMs estimated parameters. The major of these was that the simulated dwell times within an episode were shorter than in the data, and that the simulated number of episodes per event was greater. Merely restricting certain transitions did not increase the dwell times, but some indications were found that it might be necessary to either change to a hidden semiMarkov model and/or increase the number of states.

1. Introduction Pluviographs, which are rainfall accumulationtime plots, indicate a strong tendency for rainfall intensity to abruptly change from one steady rate of fall to another with these steady rates persisting for some time. Digitizing from pluviographs the times of change from one steady rain rate to another, yields breakpoint data, that is, a stream of data pairs consisting of the rainfall rate, which includes zero, and the duration of that rate. Sansom (1992) and Barring (1992) give full details on the breakpoint representation of rainfall, which is essentially different from the traditional one, in which the accumulated total over some xed period is noted or, in the case of tipping bucket gauges, the time of accumulation for a xed amount is noted. Breakpoints

Corresponding author address: Dr. John Sansom, National Institute of Water and Atmospheric Research Ltd., P.O. Box 14-901 Kilbirnie, Wellington, New Zealand. E-mail: john.sansom@niwa.cri.nz

provide a complete record of rainfall with information on intensities during periods of continuous precipitation, rather than merely mean rates during periods with generally a mixture of wet and dry times. The data examined in this paper were digitized from the daily pluviographs of a Dines tilting siphon automatic rain gauge sited at Invercargill, New Zealand (46 25 S, 168 20 E) for the 15-yr period January 1972 to December 1986. Sansom (1987) gives details of the digitization scheme while Sansom (1988) has described some of the seasonal and diurnal features of this data. Sansom and Thomson (1992) showed that the breakpoint data could be statically modeled as a mixture of lognormal components, univariate for the dry periods and bivariate for the wet periods. They also proposed a dynamic model, suitable for use with breakpoint data, which was physically realistic since it recognized that more than one mechanism is responsible for rainfall generation, and that the mechanisms operate over variable length periods and at any one place and time only one mechanism can be operating. Further grounding for this model was provided by Sansom (1995a).

1998 American Meteorological Society

JANUARY 1998

SANSOM

43

The proposed model is dependent upon breakpoint data since with such data a more detailed view of precipitation can be taken than that possible when only the common xed fall data is available. This view is laid out within the following denitions. Event: A period of time during which the atmospheric conditions continuously give rise to a nonzero probability for the occurrence of precipitation. Within an event, dry times do occur, especially if the physical mechanism changes, but not to the extent of the interevent dry breaks when for a considerable period there is no chance of any precipitation. Episode: A period of time within an event when the physical cause of the precipitation does not change, that is, the type of rain-generating mechanism during this time does not change. Dry breaks may occur within an episode. Subepisode: Part of an episode in which there are no dry breaks. Period or duration: Part or all of a subepisode during which the rate of accumulation of precipitation is constant. (These two terms are sometimes omitted as when wets or drys are referred to instead of wet periods, etc.) Fixed fall: This is the common format for rain and is the amount accumulated over a period of time which is of a xed length and is also xed with respect to the clock, for example, daily data. It should be noted that such periods could be a part of, or encompass all of, any of those periods dened above. A hierarchy is implied within these denitions with events consisting of episodes and episodes of durations with steady (or zero) rain rates. Any particular observation can be assigned a position within this hierarchy or equivalently it can be assigned to a state. There are ve states involved: I, the time between events when no precipitation is possible; Sw, showery times when a shower is taking place; Sd, showery times when no shower is taking place; Rw, rain times with rain taking place, and Rd, dry intervals during a rainy time. A Markov model is a natural choice in such a situation; however, each observation is of the rain rate and its duration and no direct information is available concerning which state the system was in when the observation was made. Thus, the data were tted to a hidden Markov model (HMM) using procedures from Rabiner (1989), which includes details of the Viterbi algorithm, that is, a method for assigning each observation to a state. More recent work (e.g., Leroux 1992; Bickel and Ritov 1996) has conrmed some of the underlying assumptions of Rabiner (1989), which remains a practical exposition for the tting of HMMs. The model allows for the distributions of rates and durations to differ from mechanism to mechanism, or state to state, and these distributions of rate and duration were taken to be lognormal, bivariate for the wet data and univariate for the dry. The tting procedures are iterative and require initialization for both the transition

matrix and the parameters of the state distributions. Rabiner (1989) suggests that uniform probabilities are sufcient to initialize the transition matrix, but values close to the eventual estimates are needed for the distributions parameters and these were obtained from applying an extension of the EM (expectation maximization) algorithm suitable for truncated/censored datasets. The EM algorithm was used by Sansom and Thomson (1992) to decompose the breakpoint data into components that could be attributed to different precipitation mechanisms. Doubt over the classication of some short-duration low-intensity periods as being wet or dry, motivated, Sansom and Thomson (1997, manuscript submitted to J. Amer. Stat. Assoc., hereafter ST97; also Sansom 1995b) to modify the EM algorithm for situations where doubtful data is dropped. Section 3 presents the application of the EM algorithm to acquire initial values for the HMM tting procedures and the results of applying these will be given in section 4, which is followed in section 5 by some discussion of the t and of some simulations. However, initially, in the next section, a review of rainfall models will be given to show that the HMM is more physically based than other models and some consideration will be given to the concept of rainfall rate. 2. Rainfall models Rainfall models can basically be divided between those that attempt to model the daily rainfall observations directly and those that model rainfall events and use either monthly, daily, or hourly data as verication. The former line goes back as far as Newnham (1916) with reviews by Woolhiser and Roldan (1982), Stern and Coe (1984), and Hutchinson (1995) among others, while the latter line probably started with Le Cam (1961) and has recently been reviewed by Burlando and Rosso (1993). The occurrence of rainfall events is acknowledged to a certain extent in the modeling of daily rainfalls by rst modeling the occurrence of wet days and then modeling the amount of rain on the wet days. To account for the persistence that is seen in the record of wet days, a rst-order Markov model was proposed by Gabriel and Neuman (1962) so that the probability that a particular day is wet depends solely on whether the previous day was wet or dry, and the lengths of dry and wet spells are geometrically distributed. This simple model has proved effective and can be extended to account for seasonal variations (Woolhiser and Pegram 1979). When the model has seemed less effective, either the order of the Markov chain has been increased (Dennett et al. 1983), or distributions other than the geometric have been tted to the lengths of the wet and dry spells (Roldan and Woolhiser 1982). However, parameter estimates for higher-order Markov chains can be unreliable, especially in dry areas. The other method also suffers from poor parameter estimation unless 25

44

JOURNAL OF CLIMATE

VOLUME 11

or more years of observations are available. Overall, the rst-order two-state Markov model generally ts the data adequately and is simpler than more elaborate models. To model the amount of rain that falls on wet days, the common assumption has been that the amounts of rain on successive wet days are independent and t a standard distribution. The ones that have been used include the lognormal, exponential, gamma, and Weibull, with the gamma being the most popular. However, small but signicant correlations have been observed between the length of the wet period and the rainfall amount and separate parameter estimates have been made after classifying days according to the wet/dry status of adjacent days (Katz 1977; Buishand 1978). Buishand dened three classes (i.e., solitary wet days, wet days between a wet and dry or a dry and wet, and wet days between wet days). He found signicant differences in the mean rainfalls for each class and proposed a model in which these means depended on the wet day class. An alternative method of including the correlations has been the multistate rst-order Markov model of Haan et al. (1976), where the transition probabilities are conditional on the rainfall amounts; Guzman and Torrez (1985) provided a simpler version. A model that encompasses both the dry and wet days is the truncated power of a normal model and it has also been found to apply equally well to hourly and monthly data. In this model, the data are transformed by a power and then tted to the upper tail of a normal distribution, which has been truncated at zero. Thus, it is a threeparameter model, that is, the mean of the normal (which may well be negative although the mean of the data is not), the normals standard deviation, and the power of the transformation. Both square and cube roots have been used (Stidd 1973; Richardson 1977), and Hutchinson et al. (1993) allowed for a spatially varying power and found that the goodness of t improved if the truncation was set at a small positive value rather than zero where the t remained acceptable. The model can be easily tted for both spatial and seasonal variation and so is practical and useful, but it does lack a physical basis. The occurrence of rainfall in events needs to be explicitly recognized in order to establish a physical basis and this cannot be achieved by directly modeling daily rainfalls or, indeed, any xed falls. The second line mentioned above attempts to do this by assuming that the starting points of rainfall events are distributed randomly along the time continuum and with each of these is an associated random amount and/or duration of rain. Of the point processes available (Cox and Isham 1980), the Poisson process provides the best compromise between simplicity and generality. In the independent Poisson marks (IPM) model (Eagleson 1972; Bacchi et al. 1989), events occur as Poisson arrivals, each with an associated random vector mark of two variables: the average intensity through the

event and the duration of the event, which is assumed to be short compared to the interarrival time of the events. The time variation of rain rate for this model consists of a series of rectangular approximations to the actual variation and the model has been extended by forming a closer approximation to the actual variation by using several rectangles. This Poisson rectangular pulse (PRP) model was developed by Rodriguez-Iturbe et al. (1987) and, although its time variation of intensity is closer to reality, it is less realistic than the IPM since the pulses will generally overlap, implying that a new regime initiates at a point before the prior one ceases. Also, despite its extra complications, it performs no better than the IPM. The events modeled by the IPM and PRP models are referred to as such in the literature but would more properly be called episodes from the denitions offered in the introduction. To model the events as dened there, the clustering of episodes needs to be included. The NeymanScott (NS) and BartlettLewis (BL) processes (see Cox and Isham 1980) both do this but in slightly different ways. For both events occur as a Poisson process, each with an associated random number of episodes, which in the NS case have random starts in relation to the event origin with no episode starting at that point, whereas in the BL case, it is the interarrival times of the episodes that are random and the rst episode occurs at the event origin (Burlando and Rosso 1993). At each episode origin, a random pulse is generated as in the IPM model, and since there are no constraints between the lengths of the pulses and the arrival times of the episodes, consecutive pulses will often overlap leading to the PRP situation, but now with some physical basis for the pulses. In these Poisson models (i.e., IPM, PRP, NS, and BL), the rate parameter can be estimated by counting the number of events that occur over the observation period and the other model parameters can be tted from the statistics of the events. However, the crucial task is the delineation of the events. There is no standard method and results vary according to the duration of the xed fall being used and the critical duration chosen to be that which separates adjacent events (Bonta and Rao 1988). To avoid such choices, an event model can be veried against data accumulated over a timescale longer than that of the events, that is, monthly data. Revfeim (1982) tted a two-parameter model (i.e., rate of event occurrence and event size) to monthly data and later (Revfeim 1984) tted a model which included the event duration. Another way of avoiding the delineation of events or episodes is to model the subepisodes, or continuously wet periods, and the intervening dry periods. Such a model is an alternating renewal continuous time (ARCT) model and was rst proposed by Green (1964) with exponentially distributed durations for the wet and dry periods. When veried against daily data, it performed as well as the Markov model of Gabriel and

JANUARY 1998

SANSOM

45

Neumann (1962), but Small and Morgan (1986) found difculties when using hourly data. Hutchinson (1990) has extended the model by adding a transition state that is always dry and divides the absolutely dry spells from the wet spells, which only connect via the transition state. Hutchinson found the dwell times in the states to be mixed exponentials and that Greens model was a natural generalization for daily timescales. As for the IPM model, rainfall amounts can be associated with the wet periods and Hutchinson (1991) replaced the constant, exponentially distributed rate by a serially correlated, gamma-distributed intensity process. It should be emphasized that the CT in ARCT refers to continuous time and, thus, continuous time data is required. With discrete, or xed-fall data, the delineation of a continuously wet period depends heavily on the period length of the xed fall: if the period were long enough, then no or few dry periods would be found; for daily data, the problem reduces to the modeling of wet and dry days; and for high time resolution data, the number of alternations between wet and dry increases as the xed-fall period decreases. Thus, using xed-fall data in ARCT models has some intrinsic difculties, which would not be suffered by breakpoint data if it were used, since it is continuous with the wet and dry times available. Apart from the purely descriptive ones such as the truncated power of normal, the models described above, as a minimum, cover the sequences of wet and dry days and the amounts of rain on wet days and, at most, recognize that rainfall episodes are clustered into events but have difculty in delineating these events or episodes. This difculty can be circumvented by modeling subepisodes, but these models, like those at the event episode level, suffer from the discretization of rainfall data and lack any representation of the variation of rain rate through wet periods. A physically based model must retain the ideas of subepisodes, within episodes, within events and should ensure that episodes do not overlap. Furthermore, unlike most of the models above, this model needs to clearly recognize that precipitation is not generated by a single process. The proposed model complies with all these requirements and the available breakpoint data is suitable to t to such a model. Thus, unlike most models that attempt physical realism, it is unique in that the data to be used closely follow the short timescale variations and it is directly tted to that data rather than being tted to xed falls from which much of the real behavior of rainfall is lost. Furthermore, the model will summarize the short-term variations as climatologically useful statistics such as the mean rain rate for convective precipitation, etc. Although the model would not appear to be easily extendable for spatial modeling, the summary statistics for a set of stations could be examined for spatial variation. It should also be noted that the primary variable that is to be tted in the HMM is the rate of precipitation

rather than its accumulation as is usually the case. Generally, only accumulations are available since xed-fall measurements, which include much dry time, are much easier to obtain than good estimates of rain rate. There are also some essential difculties with the precision of rain-rate measurements since rainfall is a discrete process, and in the limit the rain rate during a period of steady rain will vary between zero, when a raindrop is not at the point of measurement, and a large value when a drop is present. However, this ambiguity can be resolved by considering the work of Marshall and Palmer (1948) and their successors (e.g., Joss and Waldvogel 1969; Torres et al. 1994) who showed that for a given process, the rainfall rate is dependent on the distribution of raindrop sizes. Thus, in much the same way that the temperature of a gas is a bulk measure of the movement of the molecules, the rain rate is a bulk measure of the numbers and sizes of raindrops. It is assumed that the breakpoint data provides a reliable measure of this ambient rain rate. 3. Initialization of the HMM Within each state of the HMM for the breakpoint rainfall data, the observations that can be attributed to that state have a probability density whose parameters require initial values, which can subsequently be reestimated by the HMM-tting procedure. According to Rabiner (1989), these initial values need to be close to the nal estimates and are usually estimated from a training dataset in which the state of each observation is known. However, such a dataset is not available with the breakpoint data. This lack is circumvented by assuming that if the breakpoint data is statically1 modeled as a nite mixture distribution, then the components (or subsets of components) of this will align with the HMMs states. Furthermore, these components distributions will be close to those of the states of the dynamic2 HMM model. Sansom and Thomson (1992) decomposed the mixture distribution of the breakpoint data using the EM algorithm of Dempster et al. (1977) described by Redner and Walker (1984). They found that the wet periods in the breakpoint data were composed of two major components representing contributions from the rain-generating mechanisms (i.e., the Rw and Sw states or modes) and two minor components of which one, designated as mode E, was shown by simulation to be due to the inherent imperfection of manual digitization and the other, designated as mode D, due to occasional confusion over whether a particular period was wet or dry. A component representing this confusion was also found in the dry periods which, in addition, had a component

In the sense that the temporal order of the data is ignored. In the sense that the temporal order of the data is taken into consideration.
1 2

46

JOURNAL OF CLIMATE

VOLUME 11

for each rain-generating mechanism (i.e., Rd and Sd). The interevent drys required two components designated as I and M, where the rst of these is just a single dry period between precipitation events, while the second (i.e., M.) was explained by Sansom (1995a) as due to some events being weak, not giving precipitation at the observation site, not being detected there, and thus giving rise to a multiple interevent dry period. Sansom and Thomsons (1992) data were similar to that shown at the top of Fig. 1 and their results similar to the middle panel of Fig. 1 and the bottom of Fig. 2 with component E small and located away from the Rw component such that E and Sw could be taken together to represent the Sw state. On the other hand, D was collocated with signicant mass of both the Rw and Sw components in the region of low rates and short periods and might well have given rise to erroneous estimates for both the Rws and Sws distributions parameters. In an attempt to rene the Rw and Sw estimates and despite the inherent difculty, due to rainfall being a discontinuous process, that within any rainfall measurement system ambiguity exists over the presence or absence of precipitation, Sansom (1995a) analyzed a recompiled breakpoint dataset in which a new criteria to differentiate within the manually digitized data between wet and dry periods had been applied. In this paper, a further attempt to rene the Rw and Sw estimates is made by dispensing with that part of the dataset where doubtful differentiation between wet and dry exists. The rest of this section details the result of applying the EM algorithm with modications as presented in Sansom (1995b) and ST97 to the truncated or censored breakpoint dataset. The wet data that was discarded was for the lower rates at all durations, but rather than discard any dry periods, all periods, both wet and dry, were analyzed as a whole. To nalize parameter estimates for the wet distributions, some of the all periods results were used as xed values in a reanalysis. For those xed components, their locations and scales were not updated to a new estimate with each iteration of the EM algorithm, but rather those parameters were treated as given constants and only their fractional representations were estimated. But rst, it should be noted that a sufciently long period with no rain can easily be recognized as dry while for short periods, some threshold rate of rain needs to be detected if the period is to be treated as wet. Fur upper the greatest amount before the pattern shown ceased to be that of the most probable t. (Bottom) Similar to the middle but for higher truncations, i.e., levels between those shown by the sloping lines. In the top panel the truncation lines are also shown. In both the middle and bottom panels the same contours are used from mode to mode except where a mode is small and the lowest contour for the other modes cannot be used in which case the mode is shown by a dashed contour set at 95% of that modes maximum and the value of the contour as a percentage of the lowest contour.

FIG. 1. (Top) A scatterplot of the 40 570 wet breakpoint data points, each of which has a rain rate and a duration for which that rate persisted. The plot is on loglog axes and the center part has been contoured. (Middle) four-component EM t to the wet data where a small amount of truncation has been used such that data below the sloping line across the plot has been discarded. The lower of these lines shows the least amount of truncation that was used, and the

JANUARY 1998

SANSOM

47

FIG. 2. (Top) Histogram of the 16 112 dry periods lengths. (Middle) Histogram of all 56 682 period lengths, i.e., both the dry and the wets. (Bottom) Six-component EM t to all the periods with the location with respect to duration of the Sw and Rw components of the bottom panel of Fig. 1 shown by vertical dashed lines.

thermore, the shorter the period, the higher the threshold and the placement of the truncationcensoring line over the rate versus period plane was set accordingly (see Fig. 1). The difference between truncation and censoring should also be noted: in the former all information about certain data is lost, while in the latter, the count of the number of data being ignored is retained. It should also be noted that this can be estimated for the truncation case; that is, the size of the dataset before truncation, here designated N, can be estimated. Since it is not known how many periods have been misclassied with respect to their wet or dry status, truncation rather than censoring seemed more appropriate. Figure 1 shows the wet data at the top and the results of the decomposition of the truncated wet periods with the modal pattern for low truncation in the middle panel and for higher truncation in the bottom panel. The low truncation pattern is as described above with an N of about 42 000, which is only a little larger than the 40 570 observations of the dataset. For higher truncation, the E mode disappears or moves to a location (i.e., ? in the bottom panel of Fig. 1), which is not supported by those simulations that earlier suggested that E was due to digitizing effects. Also, in the higher truncation pattern the D mode becomes very prominent, representing 30% of the estimated N, which is close to 50 000, thus, 40 57050 000 0.3 5500 dry periods were misclassied as wet. This number represents about 14% of the wet data, which is much higher than would be expected given the known performance of the gaugedigitizer system that is the source of the data. A similar calculation for the low truncation case also yields around 5000 drys misclassied as wets; thus, to avoid this area of doubt in an analysis of the dry periods, truncation, so that only those periods longer than about 1 h remain, is required. Such truncation would be severe, and from the top panel of Fig. 2, which shows a histogram of the dry data, it can be seen that the mode of the dry periods distribution is at about 1 h. However, even if there is doubt at times over whether a period is wet or dry, it can be assumed that all period lengths have been digitized sufciently correctly and all periods, both wet and dry rather than only the dry periods, can be decomposed as a whole to give the dry components and the marginal duration distributions for the wet periods. A histogram of all the period lengths is given in the middle panel of Fig. 2. The result of tting all the periods is shown in the bottom panel of Fig. 2 in which the locations found for Rw and Sw in the more highly truncated wet data are shown by vertical lines; no components that might have aligned with the wet E and D modes could be found. Thus, the estimations found in this all-period analysis for Rw and Sw are close to those derived from the truncated wet data. Therfore, the all-periods t can be repeated with the locations and scales of Rw and Sw xed at values mean between the truncated-wet-data estimate and the all-periods estimate. The t differed

48

JOURNAL OF CLIMATE

VOLUME 11

little from that at the bottom of Fig. 2 and the other four components, which align with previous estimates for Rd, Sd, I, and M, represented 18 726 observations, which should be compared with the 16 112 that were classied as dry. Thus, about 2600 drys appear to have been misclassied as wet periods in which case the actual number of wet periods that should be in the dataset is about 38 000 and this can be used as N in an EM t to the censored wet data. A three-mode (i.e., Rw, Sw, and E) censored-wetdata t with N 38 000 was obtained, but it bore little resemblance to the other ts and it was necessary to x the locations of Rw and Sw to the mean values used before. Also, their relative sizes were xed in the ratio found in the all-periods-xed t and a range of values from 1% to 10% for the representation of E was tried. It was found that with Es representation at 5%, the scale parameters for Rw and Sw were close to the xed values previously used. Also, the rate variate parameters for Rw and Sw were similar to the estimates from the truncated-wet-data t and E was located in the area suggested by the simulation of the manual digitization process. The nal step in nding initial values for the states distributions parameters is to nd the parameters for a component in the wet dataset that, given the Rw, Sw, and E modes, will represent those data that, although classied as wet, were really dry times. This can be achieved by nding a four-component (i.e., Rw, Sw, E, and D) t to the full wet dataset but with all the parameters of the Rw, Sw, and E modes xed to the estimates made by the censored-wet-data t. The bottom panel of Fig. 3 shows the result of such a t in which the D component represents 2434 of the 40 570 wet data and with respect to duration is located close to the Rd mode in the dry data. The top panel of Fig. 3 shows the components from the all-periods t with xed Rw and Sw, which are attributable to dry periods, but Rds representation has been reduced from 7264 data by the 2434 of these, which had been classied as wet that is, the D mode of the bottom panel. 4. Fitting the HMM The procedures detailed in Rabiner (1989) were used to t the breakpoint dataset to a HMM with the transition probability matrix, P initialized to the same probability for all transitions. The state distributions were initialized using the values illustrated in Fig. 3 with some of the components paired to form mixture distributions for some of the states, that is, Rd with D together composed the Rd state, Sw with E the Sw state, and I with M the I state. The Rw and Sd states had single-component distributions. The tting resulted in new estimates for all these distributions, which are illustrated in Fig. 4, and it also gave an estimate for P that is,

FIG. 3. (Top) The four dry components of the EM t to all periods repeated from the bottom panel of Fig. 2 after allowing for those Rd, which were misclassied as wet. (Bottom) In the style of the lower panels of Fig. 1 showing the nal four components t to the wet data after xing E, Sw, and Rw from a censored t in which the location and sizes of Sw and Rw were xed from a mean of the all periods t and the more highly truncated wet data t.

Rw
0.742

Rd 0.113 0.000 0.068 0.000 0.000

Sw 0.115 0.451 0.349 0.963 0.806

Sd 0.023 0.000 0.433 0.000 0.000

I 0.006 0.000 0.106 0.000 0.000

0.549 0.044 0.037 0.194

Rw Rd Sw Sd I.

The Viterbi algorithm was then used to determine the sequence of states3 that maximizes the probability of the

Hence the proportional representation of each state.

JANUARY 1998

SANSOM

49

FIG. 4. In the style of Fig. 3 showing the distributions of the HMM as reestimated using the values of Fig. 3 as initial values in the HMM tting procedures.

observation sequence and the value of this probability is also available. Figure 4 resembles Fig. 3 in broad outline, but the Rw and Rd modes in the HMM are located as for the slightly truncated EM algorithm t (i.e., the middle panel of Fig. 1) only Rw is now larger and Rd is smaller than D. With regard to the components for the dry periods, their locations in Figs. 3 and 4 are similar, but the Sd and I modes appear larger in the HMM and the Rd and M smaller. Except for the E and D modes, the labels used on the various wet and dry components of Figs. 14 have not as yet been justied in any way but merely selected to conform with anecdotal expectations. The association between the wet and dry components was xed by minimizing the number of episodes, which is equivalent to maximizing rain-generating mechanism persistence.

The decision as to which pair of wet and dry modes can be assigned to rain and which to showers was made through a comparison with contemporary hourly manual weather observations. Both these methods were used in Sansom (1995a) where further details are given. An episode has been dened as a period of time during which the precipitation mechanism remains constant but during which dry intervals can take place; thus, it is a time with the sequence of period labels like Rw . . . RwRdRw . . . RwRdRw, or like Sw . . . SwSdSw . . . SwSdSw. 16 441 of the periods were labeled Rw or Rd, and these were grouped into 1808 episodes; and 37 858 periods were labeled Sw or Sd in 3846 episodes; whereas a random mixture of 16 441 things of one kind with 37858 things of another kind would on average produce 22 926 runs rather than 1808 3846 5654, which is equivalent to 176 standard deviations too few. On the other hand, if the Rd and Sd labels are swapped, then 26 213 are labeled Rw or Rd in 14 628 runs, and 28 085 are labeled Sw or Sd in 16 663 runs, but a total of 27 117 runs might be expected instead of 14 628 16 663 31 291, which is equivalent to 36 standard deviations too many. Thus, the chosen labeling minimizes the number of episodes that are of a quantity which shows that much persistence exists in the data as there are far fewer runs than might have been expected. With regard to the comparison with manual weather observation, it was found that in the 15-yr period of the data, 75% of the hours labeled as I were also judged by the human observer to be a time of no precipitation. The remaining 25%, when the observer suggested some precipitation, were mainly times when adjacent showers were reported and so corresponded to those parts of the M mode, which were really Sds. For those hours labeled as S, the manual observation agreed 88% of the time, and for R the agreement was 66%; however, with the HMM S and R labels switched, the agreements both dropped to 35%. It should be noted here that, despite the shortcomings of the HMM, which will be mentioned later, these levels of agreement for the adopted labeling and disagreement when the R, S labels were switched indicate some improvement in the HMM over the static model of Sansom (1995a). The histograms of Fig. 5 show the distributions of episode lengths in terms of both the number of breakpoints and duration as hours; the mean and standard deviations of the distributions are also shown. According to the denitions given in the introduction, events are composed of episodes and, if now R is used to denote the series of Rws and Rds within a rain episode and similarly for S, then, between any two interevent drys (i.e., Is) there will be at least one R or S episode and possibly a string of RS episode pairs. The structure of events in terms of the number of episodes is given in the left-hand column of Table 1, where in the fourth row, the n is the number of rain-episodeshower-episode (RS) pairs that occurred within an event.

50

JOURNAL OF CLIMATE

VOLUME 11

TABLE 1. Some statistics of events in terms of the number of episodes within the events. No. in data 2572 1441 22 1109 647 32 23 8 1.61 Mean No. in simulations 2694 986 17 1691 1195 81 31 12 2.17

Statistic Total no. of events No. of events with a single S No. of events with a single R No. of events like (S)(RS)n(R) No. with an initial S No. followed by a R No. with n 0 (i.e., SR pairs) Maximum of n Mean of n

5. Discussion and simulations To a certain extent, the reversion of the modal pattern in Fig. 4 from that of Fig. 3 to that of the middle panel of Fig. 1, which is similar to that used in Sansom (1995a), vindicates the discriminant analysis presented in that paper. Similarly, with Rd of Fig. 4 now suggesting that only 714 of the 40 570 wet periods should have been classied as dry, the performance of the gaugedigitizer system as a source for breakpoint data is also vindicated. It should also be noted that the relative sizes of I and M in Fig. 4 are similar to those found in Sansom (1995a). Both Figs. 4 and 5, as well as P, and Table 1 show that the HMM conforms with anecdotal expectations that during widespread rain there may be changes of rain rate but little dry time, while during showers shorter periods of steady precipitation tend to be interspersed with longer dry periods. The model indicates that, for the location concerned, about 170 precipitation events occurred every year with 56% of these being a singleshower episode and 43% a succession of RS episode pairs with an average of 1.61 such pairs in each event. However, 58% of these RS pairs actually started with an S episode and a few ended with an R with most of these consisting of an SR pair. The events covered 23% of the available time, which was divided between rain and shower episodes as 3.7% and 19.3%, respectively, but 20% of the rain time was dry as was 77% of the shower time. Rain episodes yielded about 63% of the total precipitation although there were over twice as many shower episodes with an average duration of 6.6 h, while rain episodes only averaged 2.7 h. The mean duration of the interevent drys was about 39 h. For a given transition matrix, the expected proportional representation for each state, , can be found by solving P with the constraint that, where M is the number of states, M 1 i 1. For the P estimated i from the breakpoint data, the expected count for the Is was approximately as observed, but the Rw and Rd states were less represented in the data than might have been expected (by 1000 each), while the Sw and Sd were over represented (also by 1000 each). A number of simulations were run using the estimated P and state

FIG. 5. Histograms of episode lengths in terms of the number of breakpoints in the top two panels and in terms of hours in the next lower two panels. The bottom panel is for the interevent dry periods and is in terms of hours; an equivalent in terms of breakpoints is not shown since all such periods are just one breakpoint long. In the top right-hand corner of each panel the number of episodes and the mean and standard deviation of their lengths is given.

JANUARY 1998

SANSOM

51

TABLE 2. Some statistics of episodes in terms of the number of breakpoints (brkpts.) and hours within the episodes (NB: No. of runs No. of episodes). Data Episode type Rain episodes Shower episodes Rain episodes (brkpts.) Shower episodes (brkpts.) Rain episodes (h) Shower episodes (h) Interevent dry periods (h) No. of brkpts. 16 441 37 858 Mean 9.1 9.8 2.7 6.6 39.1 No. of runs 1808 3846 Std dev 8.4 12.1 2.9 8.5 41.1 No. of brkpts. 18 433 35 740 500 500 Simulations No. of runs 3702 5783 50 100

Mean 5.0 0.05 6.2 0.10 1.5 0.02 4.7 0.10 36.5 1.00

Std dev 4.9 0.15 5.9 0.15 1.6 0.05 6.9 0.50 41.6 2.00

distributions to assess how these differences would effect the episode and event statistics. In doing this, it was found that the observed counts for the Rw, Rd, Sw, Sd, and I states were equivalent to distances of 3.8, 21.8, 6.4, 8.7, and 1.8 standard deviations, respectively, from the expected counts. Table 1 presented the statistics of both the actual and the simulated events, and Table 2 presents the statistics for both actual (repeated from Fig. 5) and simulated episodes. It can be seen from these tables that both the rain and shower episodes found in the data are longer, in terms of both the number of breakpoints and the temporal extension, on average than suggested by the simulations. Also, there are fewer episodes per event in the data than in the simulations in which the standard error of the mean of n was 0.03 and, thus, the mean number in the data is about 18 standard deviations from the expected mean number. Overall, it appears that the HMM allows easier exit from an episode than is found in the data, and some adjustment to the model is required. In the above tting procedures no allowance for seasonality had been made, and an initial adjustment could be to t the model on a month-by-month basis. However, when this was done for the initialization values, little variation was found, and when these were compared to ts for individual years, the interannual variability was also small but larger than the intraannual variation. Thus, it seems unlikely that allowing for seasonality would be sufcient adjustment to the HMM. Essentially, the required change would be to the the dwell times, in terms of the number of breakpoints, which are too short, and since these times in a Markov model are geometrically distributed, the most direct means of increasing the dwell times is to adopt a distribution other than the geometric, in particular, one with its mode greater than unity. Such a model, which also requires the exclusion of self-transitions, would be a hidden semi-Markov model (HSMM) and tting procedures are available (Rabiner 1989). The HSMM tting procedures are signicantly more complex than those for the HMM, and before turning to the HSMM, some adjustments to the HMM should

be considered. These are of two kinds: rst, by explicitly disallowing some transitions and thus restricting the connectivity of the states, and second, by increasing the number of states and thus allowing the HMM to nd ner structure in the data. The former type of adjustment was attempted and some details are given below, but while pursuing this, some suggestion was found of a seven-state model with a greater likelihood than the vestate model. However, the physical interpretation of the states proved difcult and the second type of adjustment was not attempted any further. The degree of persistence found in the data exceeded that of the HMM and any restrictions within P were, therefore, aimed at correcting this by reducing the options available for changes between episode type. In the estimated P, the transitions from or to Rd and Sd or I were not exactly zero but of the order 10 40 or less, however, these might have become signicant if other elements of P were set to zero, and as a rst restriction, they were set to zero so that all transitions between dry states were forbidden. A second restriction was to disallow changes between Rw and Sw so that episodes must change through a dry period, and a third was to insist that an Rd can only change to an Rw so that an R episode would always start and end with Rw states. Taking the three restrictions singly, in pairs, and altogether, gave seven other HMMs, which were compared to the HMM with no restrictions through the probabilities of the observation sequence given the particular model and through their s. With regard to the latter, for all models, the expected and observed population sizes of the states were signicantly different and usually in the same sense as the unrestricted model. Thus, by this measure, there was no improvement through imposing transition restrictions and by the measure through the observation sequence probability in only one instance was the unrestricted HMM value exceeded. This was with just the second restriction, when changes between Rw and Sw were forbidden, but what in other models was taken to be the Rd state, no longer seemed to ll that role. Instead over half the outward transitions from Rd were to Sw/Sd and a quarter to I, while for inward transitions to Rd half were from I and a quarter

52

JOURNAL OF CLIMATE

VOLUME 11

from Sw/Sd. Also, the size of the state had grown at the expense of Sws population size suggesting that the state was more concerned with light, considering its location in the durationrate plane, showers than with dry intervals in rain episodes. 6. Conclusions The putative light shower state alluded to at the end of the last section suggests that there may be ner structure in the rainfall process than that modeled by the ve states Rw, etc., in which case, if further rain-generating mechanisms are excluded, then one if not both of the R and S mechanisms will need to be divided into subclasses. This could certainly be handled within an HMM by allowing further states and would be acceptable physically since more than one class of synoptic situation gives rise to showers and similarly for rain. Alternatively, further states could be introduced to allow for second-order effects where a transition between states may be inuenced by the prior state and, for example, a state denoted by RRw would indicate that the current state is Rw and the prior one was either Rw or Rd. There is no particular indication, apart from the excessive dwell times in the data compared to the model, that states like RRw may be required, but some indication of subclasses within showers can be seen in the top panel of Fig. 5. In that gure, the shape of the histogram is such that it might represent a mixture with modes at one and three breakpoints, that is, one for single light showers and another for longer showery episodes. However, in all of the nine simulations in which a mixture distribution4 was not used, the resulting equivalent histograms were of a similar shape with the second class smaller than the rst and third. On the other hand, the rain breakpoint distribution in the second panel from the top of Fig. 5, where the mode is clearly not in the rst class, implies that a distribution other than geometric for these is required, and hence an HSMM rather than HMM should be tted. Furthermore, in all the simulations the equivalent distributions mode was distinctly in the rst class, and it should also be noted that to achieve with a mixture a mode away from the origin requires one of the components to be other than geometric. Thus, since merely restricting some transitions was insufcient to enable the HMM to adequately model the observations, the impetus for advancing to an HSMM appears stronger than just including additional states to the ve used in this paper. Despite the deciencies of the HMM, it did, as noted earlier, give a closer match to manual hourly weather observations than the static model of Sansom (1995a), and the description given at the beginning of section 5

and the implications of Fig. 4 accord with the general anecdotal view of rainfall. Furthermore, it appears that there is more agreement between the data and this view than with the simulations, which suggested shorter episodes and more episodes per event than in the data. Thus, in a well-tting HSMM with possibly more than ve states and suitable transition restrictions it is possible that episodes may be yet longer and the number of episodes per event smaller in which case even greater accord with the anecdotal view might be claimed. It is unfortunate that, apart from the manual weather observations, no independent dataset exists that gives at a high temporal resolution an assessment of the ambient state of the atmosphere with regard to precipitation, that is, whether at a particular time it is R, S, or I. It is also unfortunate that much manual effort is required to produce the breakpoint data and that even with the greatest care there are inherent errors in the digitizing process. Both of these issues are currently being addressed: the rst by locating breakpoint gauges in the vicinity of a weather radar from whose images it should be possible to assess which of R, S, or I is ambient; and the second with the development of processes to automatically yield breakpoint data from high temporal resolution gauges. However, the immediate future thrust will be to t the currently available data to an HSMM with ve or more states.
REFERENCES Bacchi, B., P. Burlando, and R. Rosso, 1989: Extreme value analysis of stochastic models of point rainfall. Third Scientic Assembly of IAHS, Baltimore, MD, IAHS. Barring, L., 1992: Comments on Breakpoint representation of rainfall. J. Appl. Meteor., 31, 15201524. Bickel, P. J., and Y. Ritov, 1996: Inference in hidden Markov models I: Local asymptotic normality in the stationary case. Bernoulli, 2, 199228. Bonta, J. V., and A. R. Rao, 1988: Factors affecting the identication of independent rainstorm events. J. Hydrol., 98, 275293. Buishand, T. A., 1978: Some remarks on the use of daily rainfall models. J. Hydrol., 36, 295308. Burlando, P., and R. Rosso, 1993: Stochastic models of temporal rainfall: Reproducibility, estimation and prediction of extreme events. Stochastic Hydrology and Its Use in Water Resources Systems Simulation and Optimization, J. B. Marco, Ed., Kluwer Academic, 137173. Cox, D. R., and V. Isham, 1980: Point Processes. Chapman and Hall, 188 pp. Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc., Ser. B, 39, 138. Dennett, M. D., J. A. Rodgers, and J. D. H. Keatinge, 1983: Simulation of a rainfall record for a new site of a new agricultural development: An example from northern Syria. Agric. Meteor., 29, 247258. Eagleson, P. S., 1972: Dynamics of ood frequency. Water Resour. Res., 8, 878898. Gabriel, K. R., and J. Neumann, 1962: A Markov chain model for daily rainfall occurrence at Tel Aviv. Quart. J. Roy. Meteor. Soc., 88, 9095. Green, J. R., 1964: A model for rainfall occurrence. J. Roy. Stat. Soc., Ser. B, 26, 345353. Guzman, A. G., and W. C. Torrez, 1985: Daily rainfall probabilities:

4 It should be noted that a mixture of geometrics would still have a mode in the rst class.

JANUARY 1998

SANSOM

53

Conditional on prior occurrence and amount of rain. J. Climate Appl. Meteor., 24, 10091014. Haan, C. T., D. M. Allen, and J. O. Street, 1976: A Markov chain model of daily rainfall. Water Resour. Res., 12, 443449. Hutchinson, M. F., 1990: A point rainfall model based on a threestate continuous Markov occurrence process. J. Hydrol., 114, 125148. , 1991: Climatic analysis in data sparse regions. Climatic Risk in Crop Production, R. C. Muchow and J. A. Bellamy, Eds., CAB International, 5571. , 1995: Stochastic spacetime weather models from groundbased data. Agric. Forest Meteor., 73, 237264. , C. W. Richardson, and P. T. Dyke, 1993: Normalization of rainfall across different time steps. Management of Irrigation and Drainage Systems, Park City, UT, Irrigation and Drainage Division, ASCE, U.S. Dept. of Agriculture, 432439. Joss, J., and A. Waldvogel, 1969: Raindrop size distribution and sampling size errors. J. Atmos. Sci., 26, 566569. Katz, R. W., 1977: Precipitation as a chain dependent process. J. Appl. Meteor., 16, 671676. Le Cam, L., 1961: A stochastic description of precipitation. Proc. Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, Ofce of Ordinance Research, U.S. Army, 165186. Leroux, B. G., 1992: Maximum-likelihood estimation for hidden Markov models. Stochastic Processes and their Applications, 40, 127143. Marshall, J. S., and W. M. Palmer, 1948: Relation of raindrop size to intensity. J. Meteor., 5, 165166. Newnham, E. V., 1916: The persistence of wet and dry weather. Quart. J. Roy. Math. Soc., 42, 153162. Rabiner, L. R., 1989: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257 285. Redner, R. A., and H. F. Walker, 1984: Mixture densities, maximum likelihood, and the EM algorithm. Soc. Ind. Appl. Math., Rev., 26, 192239. Revfeim, K. J. A., 1982: Comments On the study of a probability distribution for precipitation total. J. Appl. Meteor., 21, 1942 1945; Corrigendum, 22, 502.

, 1984: An initial model of the relationship between rainfall events and daily rainfalls. J. Hydrol., 75, 357364. Richardson, C. W., 1977: A model of stochastic structure of daily precipitation over an area. Colorado State University, Fort Collins Hydrology Paper 91. Rodriguez-Iturbe, I., D. R. Cox, and V. Isham, 1987: Some models for rainfall based on stochastic point process. Proc. Roy. Soc. London, Ser. A, 410, 269288. Roldan, J., and D. A. Woolhiser, 1982: Stochastic daily precipitation models. 1. A comparison of occurrence processes. Water Resour. Res., 18, 14511459. Sansom, J., 1987: Digitising pluviographs. J. Hydrol. N.Z., 26, 197 209. , 1988: Rainfall variation at Invercargill, New Zealand. N.Z. J. Geol. Geophys., 31, 247256. , 1992: Breakpoint representation of rainfall. J. Appl. Meteor., 31, 15141519. , 1995a: Rainfall discrimination and spatial variation using breakpoint data. J. Climate, 8, 624636. , 1995b: The breakpoint representation of rainfall. Proc. Sixth Int. Meeting on Statistical Climatology, Galway, Ireland, University College Galway, 355358. , and P. J. Thomson, 1992: Rainfall classication using breakpoint pluviograph data. J. Climate, 5, 755764. Small, M. J., and D. J. Morgan, 1986: The relationship between a continuous-time renewal model and a discrete Markov chain model of precipitation occurrence. Water Resour. Res., 22, 1422 1430. Stern, R. D., and R. Coe, 1984: A model tting analysis of daily rainfall data. J. Roy. Stat. Soc., Ser. A, 147, 134. Stidd, C. K., 1973: Estimating the precipitation climate. Water Resour. Res., 9, 12351241. Torres, D. S., J. M. Porra, and J. Creutin, 1994: A general formulation for raindrop size distribution. J. Appl. Meteor., 33, 14941502. Woolhiser, D. A., and G. G. S. Pegram, 1979: Maximum likelihood estimation of Fourier coefcients to describe seasonal variations of parameters in stochastic daily precipitation models. J. Appl. Meteor., 18, 3442. , and J. Roldan, 1982: Stochastic daily precipitation models. 2. A comparison of distributions of amounts. Water Resour. Res., 18, 14611468.

You might also like