Professional Documents
Culture Documents
Jump to: navigation, search Time series: random data plus trend, with best-fit line and different smoothings In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the Nile River at Aswan. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to forecast future events based on known past events: to predict data points before they are measured. An example of time series forecasting in econometrics is predicting the opening price of a stock based on its past performance. Time series are very frequently plotted via line charts. Time series data have a natural temporal ordering. This makes time series analysis distinct from other common data analysis problems, in which there is no natural ordering of the observations (e.g. explaining people's wages by reference to their education level, where the individuals' data could be entered in any order). Time series analysis is also distinct from spatial data analysis where the observations typically relate to geographical locations (e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses). A time series model will generally reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values (see time reversibility.) Methods for time series analysis may be divided into two classes: frequency-domain methods and time-domain methods. The former include auto-correlation, cross-correlation analysis, spectral analysis and recently wavelet analysis; auto-correlation and cross-correlation analysis can also be completed in the time domain.
Contents
[hide]
y
1 Analysis o 1.1 General exploration o 1.2 Description o 1.3 Prediction and forecasting 2 Models o 2.1 Notation o 2.2 Conditions o 2.3 Models 3 Related tools
y y y y
[edit] Analysis
There are several types of data analysis available for time series which are appropriate for different purposes.
Graphical examination of data series Autocorrelation analysis to examine serial dependence Spectral analysis to examine cyclic behaviour which need not be related to seasonality. For example, sun spot activity varies over 11 year cycles.[1][2] Other common examples include celestial phenomena, weather patterns, neural activity, commodity prices, and economic activity.
[edit] Description
y y
Separation into components representing trend, seasonality, slow and fast variation, cyclical irregular: see decomposition of time series Simple properties of marginal distributions
Fully formed statistical models for stochastic simulation purposes, so as to generate alternative versions of the time series, representing what might happen over non-specific time-periods in the future Simple or fully formed statistical models to describe the likely outcome of the time series in the immediate future, given knowledge of the most recent outcomes (forecasting).
[edit] Models
Models for time series data can have many forms and represent different stochastic processes. When modeling variations in the level of a process, three broad classes of practical importance are the autoregressive (AR) models, the integrated (I) models, and the moving average (MA) models. These three classes depend linearly[3] on previous data points. Combinations of these ideas produce autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) models. The autoregressive fractionally integrated moving average (ARFIMA) model generalizes the former three. Extensions of these classes to deal with vectorvalued data are available under the heading of multivariate time-series models and sometimes the preceding acronyms are extended by including an initial "V" for "vector". An additional set of
extensions of these models is available for use where the observed time-series is driven by some "forcing" time-series (which may not have a causal effect on the observed series): the distinction from the multivariate case is that the forcing series may be deterministic or under the experimenter's control. For these models, the acronyms are extended with a final "X" for "exogenous". Non-linear dependence of the level of a series on previous data points is of interest, partly because of the possibility of producing a chaotic time series. However, more importantly, empirical investigations can indicate the advantage of using predictions derived from non-linear models, over those from linear models. Among other types of non-linear time series models, there are models to represent the changes of variance along time (heteroskedasticity). These models are called autoregressive conditional heteroskedasticity (ARCH) and the collection comprises a wide variety of representation (GARCH, TARCH, EGARCH, FIGARCH, CGARCH, etc). Here changes in variability are related to, or predicted by, recent past values of the observed series. This is in contrast to other possible representations of locally varying variability, where the variability might be modelled as being driven by a separate time-varying process, as in a doubly stochastic model. In recent work on model-free analyses, wavelet transform based methods (for example locally stationary wavelets and wavelet decomposed neural networks) have gained favor. Multiscale (often referred to as multiresolution) techniques decompose a given time series, attempting to illustrate time dependence at multiple scales.
[edit] Notation
A number of different notations are in use for time-series analysis. A common notation specifying a time series X that is indexed by the natural numbers is written X = {X1, X2, ...}. Another common notation is Y = {Yt: t T},
[edit] Conditions
There are two sets of conditions under which much of the theory is built:
y y
However, ideas of stationarity must be expanded to consider two important ideas: strict stationarity and second-order stationarity. Both models and applications can be developed under
each of these conditions, although the models in the latter case might be considered as only partly specified. In addition, time-series analysis can be applied where the series are seasonally stationary or nonstationary. Situations where the amplitudes of frequency components change with time can be dealt with in time-frequency analysis which makes use of a timefrequency representation of a time-series or signal.[4]
[edit] Models
Main article: Autoregressive model The general representation of an autoregressive model, well-known as AR(p), is where the term t is the source of randomness and is called white noise. It is assumed to have the following characteristics: 1. 2. 3. With these assumptions, the process is specified up to second-order moments and, subject to conditions on the coefficients, may be second-order stationary. If the noise also has a normal distribution, it is called normal white noise (denoted here by Normal-WN): In this case the AR process may be strictly stationary, again subject to conditions on the coefficients.
Mean
From Wikipedia, the free encyclopedia Jump to: navigation, search This article is about the statistical concept. For other uses, see Mean (disambiguation).
the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).
the expected value of a random variable, which is also called the population mean.
There are other statistical measures that use samples that some people confuse with averages including 'median' and 'mode'. Other simple statistical analyses use measures of spread, such as range, interquartile range, or standard deviation. For a real-valued random variable X, the mean is the expectation of X. Note that not every probability distribution has a defined mean (or variance); see the Cauchy distribution for an example. For a data set, the mean is the sum of the values divided by the number of values. The mean of a set of numbers x1, x2, ..., xn is typically denoted by , pronounced "x bar". This mean is a type of arithmetic mean. If the data set was based on a series of observations obtained by sampling a statistical population, this mean is termed the "sample mean" to distinguish it from the "population mean". The mean is often quoted along with the standard deviation: the mean describes the central location of the data, and the standard deviation describes the spread. An alternative measure of dispersion is the mean deviation, equivalent to the average absolute deviation from the mean. It is less sensitive to outliers, but less mathematically tractable. If a series of observations is sampled from a larger population (measuring the heights of a sample of adults drawn from the entire world population, for example), or from a probability distribution which gives the probabilities of each possible result, then the larger population or probability distribution can be used to construct a "population mean", which is also the expected value for a sample drawn from this population or probability distribution. For a finite population, this would simply be the arithmetic mean of the given property for every member of the population. For a probability distribution, this would be a sum or integral over every possible value weighted by the probability of that value. It is a universal convention to represent the population mean by the symbol .[1] In the case of a discrete probability distribution, the mean of a discrete random variable x is given by taking the product of each possible value of x and its probability P(x), and then adding all these products together, giving .[2] The sample mean may differ from the population mean, especially for small samples, but the law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean.[3] As well as statistics, means are often used in geometry and analysis; a wide range of means have been developed for these purposes, which are not much used in statistics. These are listed below.
Contents
[hide]
y
1 Examples of means o 1.1 Arithmetic mean (AM) o 1.2 Geometric mean (GM) o 1.3 Harmonic mean (HM) o 1.4 Relationship between AM, GM, and HM
y y y y
1.5 Generalized means 1.5.1 Power mean 1.5.2 -mean o 1.6 Weighted arithmetic mean o 1.7 Truncated mean o 1.8 Interquartile mean o 1.9 Mean of a function o 1.10 Mean of a Probability Distribution o 1.11 Mean of angles o 1.12 Frchet mean o 1.13 Other means 2 Properties o 2.1 Weighted mean o 2.2 Unweighted mean o 2.3 Convert unweighted mean to weighted mean o 2.4 Means of tuples of different sizes 3 Population and sample means 4 See also 5 References 6 External links
o
The arithmetic mean is the "standard" average, often simply called the "mean". The mean may often be confused with the median, mode or range. The mean is the arithmetic average of a set of values, or distribution; however, for skewed distributions, the mean is not necessarily the same as the middle value (median), or the most likely (mode). For example, mean income is skewed upwards by a small number of people with very large incomes, so that the majority have an income lower than the mean. By contrast, the median income is the level at which half the population is below and half is above. The mode income is the most likely income, and favors the larger number of people with lower incomes. The median or mode are often more intuitive measures of such data. Nevertheless, many skewed distributions are best described by their mean such as the exponential and Poisson distributions. For example, the arithmetic mean of six values: 34, 27, 45, 55, 22, 34 is
The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth. For example, the geometric mean of six values: 34, 27, 45, 55, 22, 34 is:
The harmonic mean is an average which is useful for sets of numbers which are defined in relation to some unit, for example speed (distance per unit of time). For example, the harmonic mean of the six values: 34, 27, 45, 55, 22, and 34 is
AM, GM, and HM satisfy these inequalities: Equality holds only when all the elements of the given sample are equal.
The generalized mean, also known as the power mean or Hlder mean, is an abstraction of the quadratic, arithmetic, geometric and harmonic means. It is defined for a set of n positive numbers xi by By choosing the appropriate value for the parameter m we get
maximum,
m=2 m=1
m = 1 harmonic mean,
minimum.
[edit] -mean
This can be generalized further as the generalized f-mean and again a suitable choice of an invertible
harmonic mean,
will give
(x) = xm
power mean,
In calculus, and especially multivariable calculus, the mean of a function is loosely defined as the average value of the function over its domain. In one variable, the mean of a function (x) over the interval (a,b) is defined by (See also mean value theorem.) In several variables, the mean over a relatively compact domain U in a Euclidean space is defined by This generalizes the arithmetic mean. On the other hand, it is also possible to generalize the geometric mean to functions by defining the geometric mean of to be More generally, in measure theory and probability theory either sort of mean plays an important role. In this context, Jensen's inequality places sharp estimates on the relationship between these two different notions of the mean of a function. There is also a harmonic average of functions and a quadratic average (or root mean square) of functions.
Arithmetic-geometric mean Arithmetic-harmonic mean Cesro mean Chisini mean Contraharmonic mean Elementary symmetric mean Geometric-harmonic mean Heinz mean Heronian mean Identric mean Lehmer mean
y y y y y y y y
Logarithmic mean Median Moving average Root mean square Stolarsky mean Weighted geometric mean Weighted harmonic mean Rnyi's entropy (a generalized f-mean)
[edit] Properties
All means share some properties and additional properties are shared by the most common means. Some of these properties are collected here.
"Fixed point": M(1,1,...,1) = 1 Homogeneity: M( x1, ..., xn) = M(x1, ..., xn) for all and xi. In vector notation: M( x) = Mx for all n-vectors x. Monotonicity: If xi yi for each i, then Mx My
It follows
y y y
Boundedness: min x Mx max x Continuity: There are means which are not differentiable. For instance, the maximum number of a tuple is considered a mean (as an extreme case of the power mean, or as a special case of a median), but is not differentiable. All means listed above, with the exception of most of the Generalized f-means, satisfy the presented properties. o If f is bijective, then the generalized f-mean satisfies the fixed point property. o If f is strictly monotonic, then the generalized f-mean satisfy also the monotony property. o In general a generalized f-mean will miss homogeneity.
The above properties imply techniques to construct more complex means: If C, M1, ..., Mm are weighted means and p is a positive real number, then A and B defined by are also weighted means.
Intuitively spoken, an unweighted mean is a weighted mean with equal weights. Since our definition of weighted mean above does not expose particular weights, equal weights must be asserted by a different way. A different view on homogeneous weighting is, that the inputs can be swapped without altering the result. Thus we define M to be an unweighted mean if it is a weighted mean and for each permutation of inputs, the result is the same.
Symmetry: Mx = M( x) for all n-tuples and permutations on n-tuples.
Analogously to the weighted means, if C is a weighted mean and M1, ..., Mm are unweighted means and p is a positive real number, then A and B defined by are also unweighted means.
Given an arbitrary tuple x, which is partitioned into y1, ..., yk, then (See Convex hull.)
Median
From Wikipedia, the free encyclopedia Jump to: navigation, search This article is about the statistical concept. For other uses, see Median (disambiguation). Not to be confused with Median language.
In probability theory and statistics, a median is described as the numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values.[1][2] In a sample of data, or a finite population, there may be no member of the sample whose value is identical to the median (in the case of an even sample size), and, if there is such a member, there may be more than one so that the median may not uniquely identify a sample member. Nonetheless, the value of the median is uniquely determined with the usual definition. A related concept, in which the outcome is forced to correspond to a member of the sample, is the medoid. At most, half the population have values less than the median, and, at most, half have values greater than the median. If both groups contain less than half the population, then some of the population is exactly equal to the median. For example, if a < b < c, then the median of the list {a, b, c} is b, and, if a < b < c < d, then the median of the list {a, b, c, d} is the mean of b and c; i.e., it is (b + c)/2. The median can be used as a measure of location when a distribution is skewed, when end-values are not known, or when one requires reduced importance to be attached to outliers, e.g., because they may be measurement errors. A disadvantage of the median is the difficulty of handling it theoretically.[citation needed]
Contents
[hide]
y y y y y
1 Notation 2 Measures of statistical dispersion 3 Medians of probability distributions o 3.1 Medians of particular distributions 4 Medians in descriptive statistics 5 Theoretical properties o 5.1 An optimality property
y y y y y y y y
o 5.2 An inequality relating means and medians 6 The sample median o 6.1 Efficient computation of the sample median o 6.2 Easy explanation of the sample median 6.2.1 For an odd number of values 6.2.2 For an even number of values 7 Other estimates of the median 8 Median-unbiased estimators, and bias with respect to loss functions 9 In image processing 10 In multidimensional statistical inference 11 History 12 See also 13 References 14 External links
[edit] Notation
The median of some variable x is denoted either as or as [3]
parameter x0 and scale parameter y is x0, the location parameter. The median of an exponential distribution with rate parameter is the natural logarithm of 2 divided by the rate parameter: 1 ln 2. The median of a Weibull distribution with shape parameter k and scale parameter is (ln 2)1/k.
Even though sorting n items requires O(n log n) operations, selection algorithms can compute the kth-smallest of n items (e.g., the median) with only O(n) operations.[4]
As an example, we will calculate the sample median for the following set of observations: 1, 5, 2, 8, 7. Start by sorting the values: 1, 2, 5, 7, 8. In this case, the median is 5 since it is the middle observation in the ordered list. The median is the ((n + 1)/2)th item, where n is the number of values. For example, for the list {1, 2, 5, 7, 8}, we have n = 5, so the median is the ((5 + 1)/2)th item.
median = (6/2)th item median = 3rd item median = 5 [edit] For an even number of values
As an example, we will calculate the sample median for the following set of observations: 1, 5, 2, 8, 7, 2. Start by sorting the values: 1, 2, 2, 5, 7, 8. In this case, the average of the two middlemost terms is (2 + 5)/2 = 3.5. Therefore, the median is 3.5 since it is the average of the middle observations in the ordered list. We also use this formula MEDIAN = {(n+1)/2} th item . n= Number of values As above example 1,2,2,5,7,8 n=6 Median={(6+1)/2}th item =3.5 th item
3rd item is 2 Median Median Median Median Median = = = = = {2+(0.5*(difference of 3rd and 4th item)} {2+(0.5*(2-5)} {2+(0.5*3)} (2+1.5) 3.5
distributions to the data and calculating the theoretical median of the fitted distribution. See, for example Pareto interpolation.
In monochrome raster images there is a type of noise, known as the salt and pepper noise, when each pixel independently become black (with some small probability) or white (with some small probability), and is unchanged otherwise (with the probability close to 1). An image constructed of median values of neighborhoods (like 33 square) can effectively reduce noise in this case.
In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution[1]. In some fields, notably education, sample data are often called scores, and the sample mode is known as the modal score.[2] Like the statistical mean and the median, the mode is a way of capturing important information about a random variable or a population in a single quantity. The mode is in general different from the mean and median, and may be very different for strongly skewed distributions. The mode is not necessarily unique, since the same maximum frequency may be attained at different values. The most ambiguous case occurs in uniform distributions, wherein all values are equally likely.
Contents
[hide]
y y y
y y y
1 Mode of a probability distribution 2 Mode of a sample 3 Comparison of mean, median and mode o 3.1 When do these measures make sense? o 3.2 Uniqueness and definedness o 3.3 Properties o 3.4 Example for a skewed distribution 4 See also 5 References 6 External links
In symmetric unimodal distributions, such as the normal (or Gaussian) distribution (the distribution whose density function, when graphed, gives the famous "bell curve"), the mean (if defined), median and mode all coincide. For samples, if it is known that they are drawn from a symmetric distribution, the sample mean can be used as an estimate of the population mode.
The algorithm requires as a first step to sort the sample in ascending order. It then computes the discrete derivative of the sorted list, and finds the indices where this derivative is positive. Next it computes the discrete derivative of this set of indices, locating the maximum of this derivative of indices, and finally evaluates the sorted sample at the point where that maximum occurs, which corresponds to the last member of the stretch of repeated values.
Total sum divided by number of values Middle value that separates the greater and lesser halves of a data set Most frequent number in a data set
(1+2+2+3+4+7+9) / 7 1, 2, 2, 3, 4, 7, 9 1, 2, 2, 3, 4, 7, 9
4 3 2
[edit] Properties
Assuming definedness, and for simplicity uniqueness, the following are some of the most interesting properties.
y
All three measures have the following property: If the random variable (or each value from the sample) is subjected to the linear or affine transformation which replaces X by aX+b, so are the mean, median and mode.
However, if there is an arbitrary monotonic transformation, only the median follows; for example, if X is replaced by exp(X), the median changes from m to exp(m) but the mean and mode won't. Except for extremely small samples, the mode is insensitive to "outliers" (such as occasional, rare, false experimental readings). The median is also very robust in the presence of outliers, while the mean is rather sensitive. In continuous unimodal distributions the median lies, as a rule of thumb, between the mean and the mode, about one third of the way going from mean to mode. In a formula, median (2 mean + mode)/3. This rule, due to Karl Pearson, often applies to slightly non-symmetric distributions that resemble a normal distribution, but it is not always true and in general the three statistics can appear in any order.[3][4] For unimodal distributions, the mode is within standard deviations of the mean, and the root mean square deviation about the mode is between the standard deviation and twice the standard deviation.[5]
Indeed, the median is about one third on the way from mean to mode. When X has a much larger standard deviation, Now
y y