Statistics

Research ideas under development

 

A new graded concept of randomness and probability: We introduce a new concept of  statistical randomness. The concept of randomness in statistics depends on the observer. It is due to lack of information from the cognitive observer. If we were in the place of the creator we might not need the concept of randomness and statistical probability at  all. But we are from the point of view of a human mind, and much of information from the causalities is missing when observing a phenomenon. Traditionally in statistics the concept of randomness is defined by a large sample of independent, identically distributed repeated events. The sample defines in its turn the histogram and probability distribution, and therefore the statistical frequencies or probabilities of the various events to happen. It is often criticized that there are not such a thing as  independent and identically distributed events  in the physical reality. But sometimes a simplification is a fair start. We nevertheless introduce a graded concept of randomness and statistical probability that accounts for a whole range between strongly non-independent and non-identically distributed events in physical reality and the simplification of  independent and identically distributed events. The key is that any sample is always a stratified or layered sample, from the single event to a largest available number of repetitions of events. In other words we have a sample of samples, and each element of the latter sample is a again a sample till the final one-element samples or events. At the bottom layer of samples of single events, we get for each such sample a probability distribution, which is nevertheless upper limited on the permitted  size of samples. For larger samples we have to account for samples of samples (2nd layer) and therefore already the 1st layer statistical frequencies or probabilities become random variables, which permits the initial probability distribution to vary randomly. And so on till the largest available sample of samples. It is obvious that e.g. if we have a largest possible sampling of N repetitions of events, then the above scheme, definitely is different from a flat and straight-jacket concept that always for this large N, there will be fixed statistical frequencies (probabilities) of their occurrence. The sample stratification or layering requires of course some key sizes for the maximum size of samples in each layer, that have to be introduced by universal concepts of time and space when such statistical experiments take place e.g. on the surface of the planet earth. It holds also the principle of "synergy" or "non-totalitarian distribution of information" , in other words that no level can contain all the randomness or probabilistic and statistical  information of all levels.

In the next paragraph we apply somehow the above idea on the stochastic processes and time series and we get new concepts of statistical forecasting.

 

 

 

In the next paper we formulate the idea that a forecasting e.g. with a time series, should be carried out in more than one scale , so as to include short, mid and long term scales or levels or layers. Since each scale gives by estimation a different model , it is not possible to assign a single model to the phenomenon , unless we resort to the concept of higher stochastic order random variables. That is, random variables that their expected average (1st moment ) and other moments are  already an other one (hidden) stochastic  variable etc. Both the Top-down (hidden) influence of coarser resolutions and the Bottom-up (hidden) influence of finer resolutions are included in the forecasting at a particular resolution through the old technique of Bayes estimators. Both techniques coexist in a Middle-out technique or in Top-and-Bottom-to-Middle  (Middle-in) technique..  In both cases the higher order  moments of join densities are defined  due to the evolution in coarser resolutions (Top-Down approach) and due to the evolution in finer resolutions (bottom-up approach) We show that with this multi-scale, and somehow, multi-model technique with level-wise equations, multi-resolution analysis (MRA) and abstract self-similarity relative to statistical properties, the forecasting at a particular focus scale has less  error than the classical single-resolution Box-Jenkins method when estimated on the highest resolution and then propagated to the focus resolution. It is defined the join probability densities of  two different levels or resolutions relative to a third observational level. This defines the vertical stochastic causality or interaction among levels, which depends on the observation level too. Of course the vertical causality includes how the sequence of resolutions is chosen (that MRA usualy defines in a multiplicative progression at the bins sizes). This stochastic causality is different from the horizontal stochastic causality, at a single focus level, which is the usual in the stochastic processes. The vertical stochastic causality splits in to a) how the "law" of the process varies among resolution levels (e.g. see below HLM) b) what does not change among the resolution levels (self-similarity see MRA below).  The whole situation is both more abstract and more sophisticated than simple self-similarity and intermittency.  All the two causalities coexist and have interesting lines of interpretation. Although such a quantitative conceptualization may seem new in statistics, in its qualitative version, it is today  a wide-spread concept .These concepts, not in their quantitative but in their qualitative version,  have been  also , already introduced in Logic and formal languages, by Russell and others since the beginning of the 20th century. But they were known only to a small population of experts. They are known also in a wider population in  an informal way, during  the last 5 decades and at the end of the 20th century, in the sciences of management,  and the recent object-oriented programming languages in computer science.  There is an interesting article and small philosophical treatise by A. Koesler (1944), with title  "The Yogi and the Commissar II" in his book with the  title "The Yogi and the Commissar ". He discusses the concept of level-wise world causality and freedom in the events described , in the sciences (physics, chemistry, biology, social sciences, politics, psychology, metaphysics etc). Here nevertheless we are concerned with the statistical causality, which is a more modest form of causality for the human knowledge. In addition, the present approach is based also on new statistical techniques. The modern differentiation of real-time weather forecasting, and seasonal weather forecasting is a clear example where the sampling of the data at different time scales requires as causality, a different system of equations. The principles that we are lead are more sophisticated and with  less reductionism, even from those conceived by Koesler in his time.  At each level it is assumed that the description can be done with the standard classical approach which is always under some :"principle of simplicity" (E.g. for ARMA(p,q) models the Box-Jenkins parsimony of low order p, q=0,1,2 ). This principle of simplicity in the context of a functional linear space of paths is as if projecting a large space of paths to  a smaller linear space. In the context of spectral analysis this would correspond to a band-pass  filter that reduces the overall process to a simpler narrow band process centered at a particular range of frequencies.  An other very important principle in stochastic estimation that we introduce is the "principle of sufficient large sample of paths , of sufficient  large  length". We this we mean that we accept a stochastic estimation about a stochastic process only if it is derived from a sufficient large sample of paths. E.g. for economic stochastic estimations , it is required many hundreds of paths for at least 20 years. The usual technique in most of  the academic research is unfortunately to use only one path (one element sample of sample paths). They use only one path because only one path is available e.g. when the prices of some commodity, or level of gross national product etc is recorded. But this leads to highly vague and  essentially risky estimations. Many different models can be fitted on a single path , all with adequate goodness of fit. And then completely different methods should be devised to discriminate the realistic from the unrealistic. Academic researchers, as usually they are interested mainly for their carrier development, they find it convenient to have the freedom and easiness to fit many different models on the same data, just for reasons of publications, and they are not concerned much about how real and true their models might be. This unfortunately supports the accusations that with statistics you can present lies with very sophisticated and scientific cloths. The methodology to use only one path in the estimation for a model, is analogue with trying to fit a distribution of random variable on a sample of only one observation. Many claims can be supported: that it is  a normal random variable, that it is an exponential , that it is Weibull etc simply because the data are not enough! In addition to this I have found in many books a common and classical mistake, where when the estimate the autocovariance sequence in a time series which is e.g. assumed an ARMA(p,q), they use the standard formula of the sample covariance, and they apply it in a moving way (on observation n, n-1, n-2,... then on n-1,n-2,n-3 ... then on n-2,n-3,n-4.. etc) forgetting that there is dependence of memory and order p , thus making their sample not and independent i.i.d. sample, and thus the formula of sample covariance not really valid. The correct is to use either partitions in the intervals of size larger to p, and/or to use many independent parallel paths .

The lines of quantitative interpretation include the following principles:

Based on a set of data for a phenomenon:

0) Events are defined never in one level only , although we may attribute a focus level and a spectrum of significant neighborhood levels on which the event has a trace-counterpart. (multi-level events, no-event is only single-leveled)

1) Information across all levels can be projected as information on a single focus level only, that can be the bottom, the Top, or a middle always without violating  the principle of simplicity . Nevertheless this is only the partial classical description. The standard one, in this approach, is that , it is defined the distribution of the data-information among levels, so that no level, has the information of all other levels, without violating the principle of simplicity. It is possible nevertheless, to make an inclusive order of the data-information, which can be bottom-up, top-down or Middle-out by violating the principle of simplicity. If the reductions is done without violating the principle of simplicity, this does not give an adequate description of the horizontal, and vertical stochastic causalities, that remain non-reducible. Forecasting under the principle simplicity at single level (e.g. with ARMA(p,q) models and the Box-Jenkins parsimony of low order p,q=0,1,2)  would lead to rather clumsy forecasting with a lot higher forecasting error. (Law of Fair or "non-totalitarian" distribution of information across levels. They tell me that Buckminster Fuller, has repeatedly used and coined a term "SYNERGY" for system thinking with this "democratic" or non-reductionism  meaning: No part (here, level) can contain the information for a forecasting of  the whole (here all levels). )

2)Horizontal Stochastic Causality and innovation, exists and is entering, at each level, in a basically  separate and different way from that of the other  levels describable within the principle of simplicity. (Horizontal stochastic causality is always level-wise.  E.g. with ARMA(p,q) models, we would require different coefficients and different p, q for different levels))

3) The inter-level stochastic causality, relates and gives the interplay, of the different horizontal stochastic causalities, and is defined  always on a reference observation level. (There is , also a vertical inter-level dependence of horizontal level-wise causalities, over a reference binding level. This would give for example with ARMA(p,q) models of p,q=0,1,2, a recursive way to integrate all levels on s ingle stochastic process. If the integration is a whole class of stochastic processes, in other words something more abstract, it is called stochastic scheme)

4) Horizontal stochastic causalities of different levels than  the focus level, smaller or larger, have a corresponding , higher stochastic order horizontal causality, on the focus level, after appropriate nested enhancement, of the information of the focus level. (Orders of hidden horizontal causalities may correspond to the horizontal causalities of other levels , as projected to the focus level. E.g. with ARMA(p,q) , p,q=0,1,2 models if we are to project all information of other levels to  single level we must violate the principle of simplicity (p,q <=2) and we would create higher order ARMA)

5) The horizontal causality of a focus level, is not derivable, through the horizontal causality of other levels, larger or smaller,  and the inter-level, interplay. (Law of Irreducibility of a horizontal level-wise 1st-stochastic order causality to smaller or larger scale causality, and inter-level interplay. E.g. with ARMA(p,q) , p,q=0,1,2 models at different levels , if we integrate at a single focus levels by higher order ARMA(p,q) , the latter cannot be derived by the  ARMA(p,q) models of each level alone without some binding inter-scale law which is an extra information)

 

 

 

The representation of the phenomenon

This technique in its main ideas but with all its details, could be combined with a pattern recognition statistical forecasting or a robust non-parametric statistical forecasting. This means that although we may accept a finite memory in the process we need not assume stationarity not even  time invariance of the partial correlations. We assume only time invariance of the conditional (on values of the memory horizon) distributions, that may be different for different patterns of values. Such a class of processes requires a particular sampling technique for its statistics that resemble the matching technique in sampling. It is known that matching increases the power of hypotheses tests. We can use of course parametric approach with multilevel statistical models , multilevel random coefficient models (MRCM) and Hierarchical Linear Models (HLM). The technique to create a next sampling granulation level and a next equation (causality)  by varying constants parameters  (or summarizing varying parameters  in to a  random constant) that were within the previous sampling granulation level, of the previous equation, with a new equation, sampled on the next  granulation level, can be used both for bottom-up or top-down or both (middle-out, middle-in), enhancement of the single-level model.  There is  in the Internet easily accessible software (statistical packages) for MRCM, and HLM and multilevel models  (E.g. HLM, BIRAM, BMDP,BUGS, EGRET, GENSTAT, ML3, VARCL,SABRE, SAS. These should be used with SPSS, or Minitab, Lisrel etc   For relevant books see H. Goldstein "Multilevel Statistical Models"  Arnold Publications 1995, and N. T. Longford "Random Coefficient Modles" Oxfort University press 1993). We may enhance the traditional HLM that contains only horizontal linear equations, with vertical linear equations also and appropriate further sampling technique on the same sequence of data. In multi-resolution Analysis this is taken care by the self-similarity property. The linear equations have the exploratory meaning of partial correlations structure, that always exists in random variables, rather than ad hoc confirmatory meaning. Resorting also to structural equations modeling (SEM) and Factor analysis only for the vertical equations, we may explain inter-scale causality and clustering of different scales as factors. Hierarchical Linear Models, permit the combination of the best-fit simple linear models for forecasting at different sampling granulation scales (resolution levels), to a single vector stochastic process. Although at each level the equations are linear, in the overall, the system is not a  linear system of equations, but rather a system of multivariate polynomials of higher order . The polynomial order increasing with the number of levels entering. It can be proved that with such a way of combining linear models, the forecasting error at each level is less, than the error that would result if we would extend best-fit linear forecasting from the finest granulation, to  other horizon levels. The author has designed and supervised an MSc dissertation, especially so as  to prove the above. This paradox is very relevant to the Simpson’s Paradox. In this case a single granulation level or resolution may lead to biased forecasting. The effect of a granulation level to an other can also be handled with the Mantel-Haenzel method for confounding. The confounding factor here is the effect of a granulation level or strata, to an other. 

The above analysis is also very relevant to the recent developments of wavelet analysis, and multi-resolution analysis (MRA) in signal processing and more generally in numerical harmonic analysis (see e.g. G. Kaiser "A friendly guide to wavelets" Birkauser 1994,  L. Debnath "wavelets and signal Processing" Birkauser 2003, and in particular the article of A. Benassi, S. Cohen, S. Deguy, J. Istas "Self-similarity and Intermittency" in the last mentioned book). The multi-resolution analysis and the resulting wavelet bases , is a remarkable new system of techniques, (with an algebra of up-sampling , down-sampling operators, resembling the above analysis)  with impressive success among different scientific disciplines. Although wavelet analysis is concerned mainly with the analysis and composition of a single path (signal) in the above discussion we are interested in stochastic processes, or the statistical properties of a group (sample) of parallel paths. Of course such groups of paths in a particular special cases  could be produced by a deterministic dynamic system. The concept of self-similar stochastic processes (among e.g. resolution levels) is a very restrictive condition, as we see it, and in the above discussion we adopt a more flexible, approach, that of (partial) self-similarity of  a stochastic processes with respect to a statistical property, among a sequence of different resolution levels. This statistical property may not be or may be a characteristic and defining property of the stochastic process. If it is a characteristic and defining property of the stochastic process, then we get the classical definition of self-similar stochastic process. What is the  sequence of different resolutions over which self-similarity holds,  is part of what can be called "vertical law" of the process (see also C. A. Gabrelli, and U. M. Molter  Generalized Self-Similarity J Math Ann. Appl.230 (1999) ,251-260). While the technique of HLM gives ways of how the coefficients of the "Law" of the process may change among the resolution levels, the partial self-similarity shows what is not to change among the resolution levels.

 

Such substantial improvements in forecasting are based of course in a much more sophisticated multiple analysis of the data and may have interesting consequences in weather statistical forecasting , earthquake statistical forecasting , medical statistical forecasting, ecological phenomena statistical forecasting, social phenomena statistical forecasting, and in combining real time forecasting with seasonal probabilistic forecasting of them. E.g. for the forecasting of earthquakes, we may assume event processes, that their hazard rate, follows a Hierarchical Linear Model with random coefficients level-wise equations. Each equation is like a difference linear equation that the time-steps or space-granulation steps are at different time and space scales. Thus e.g. although in small time scale, the hazard rate maybe uniform, at a different time scale may have cycles. Such cycles  (e.g. depending on the cycles of motion of pieces of the earth's solid surface, or of the  rotation of the moon, and its tide effects on earth),  in their turn and at different scale (new level-wise equation on the coefficients of the previous) have fluctuations on their period or amplitude. (These fluctuation  might be due to the sun's 11 or 22 years cycles or due to alignment of many planets in the same direction, sun or moon eclipses, the times of 21 June/22 December when the earths orbital motion changes direction: toward or way from the sun,  etc). Earthquake forecasting or more exactly  probability of earthquakes hazard rates forecasting may seem to have a very negative emotional effect to many, but the real goal is simply an intelligent use of the available information hitherto. Of course we should not think of the celestial bodies as a main source of the earthquake events, as geology considers as main source the stresses of the earths solid surface. In addition celestial bodies cycle may very well smooth out the intensity of the earthquakes as they trigger many earthquakes of smaller intensity, that reduce the stresses. Still triggering of the event of earthquakes may be correlated with celestial bodies cycles. Similar reasoning may apply in the weather formations and events, and daily or yearly seasonal  cycles. Again similar reasoning may apply in medical and health events in populations. A  hierarchical Linear Model with level-wise random coefficient models, may describe the effects and contingencies of health events at different scales of social groups of the population. Here the different sizes of the samples or of the active population, give rise to different equations that are the level-wise equations of a HLM  in an overall larger population scale. In the same way the HLM model could be about fertility or water-levels ecological events, that again may dependent on the period of moon, earths daily and seasonal cycles but furthermore Sun's 11 or 22 years cycles. In addition cycles of alignment of planets may affect the solar wind that reaches earth thus again fertility and ecological cycles. Similar forecasting (with HLM) may apply to  social events and event of interest to sociology. The levels in the sampling, statistical causality and level-wise equations shall correspond to levels of population , social organization scale and time. E.g. Business, Domestic Economy, Unions of  Societies (e.g. European Union), or  all societies in the planet, year cycles, political cycles, etc.

 

 

0) The Rainbow Stochastic Growth Model.   By Dr Costas Kyritsis (2006)

We give an example of a new concept of stochastic process that has some of the above properties and represents the stochastic growth (of clusters of cells, trees or animals populations, clusters of human activities that can measured, etc) . We call the stochastic process, The Rainbow Stochastic Growth Model. The stochastic process has the features of multi-level causality, (12 layers) as described above. If we would like to approximate this process, with linear ARMA(p,q), or ARIMA(p,q) or SARIMA or other familiar type that reduces to linear systems would need a memory of at least 144 terms. But the chosen formulation  is much simpler that a linear time series and reflects the a stochastic growth with innovation that is not "white noise" or "pink noise" but rather a "multi-color noise". This is also the inspiration of the chosen term: Rainbow. We have tuned the "spectral colors" to 12 basic cycles that the science of astronomy, meteorology and ecology has detected for normal conditions of events in this planet.

 

 

1) "Multi-resolution system of stochastic processes, and forecasting,  with higher stochastic order random variables and Bayes estimators." By Dr Costas Kyritsis 1999

In this paper we define stochastic differential equations and calculi , over finite resolutions. We compare them with the usual stochastic differential equations with limits of ITO, or Stratonovitch or by generalised functions (distributions) and discuss their tremendous advantages and simplicity in definition and solution. While very few ITO stochastic differential equations have been solved, practically all stochastic differential equations over finite resolutions are easily solved. The choice of the rounding relative to the  accuracy level, in the stochastic differential equations over finite resolutions is a key point. As a first simple approach only 1st stochastic order random variables are used.

2)Multi-resolution stochastic differential calculi

By Dr Costas Kyritsis 2000

This paper is a direct application of the previous and simple qualitative analysis on the solutions of random coefficient linear system of two 1st order stochastic differential equations.

3) Application of the solution of stochastic differential equations over multi resolution systems, in the creation of a statistical method to estimate  the probability  of the random formation of hurricanes (Tornados). By Dr Costas Kyritsis 2001

 

4) To appear in the Journal "Archives of Economic History"

 

 

 

  How to chose the time scales in statistical time series forecasting

  To appear in the Journal "Archives of Economic History"

By Dr Costas Kyritsis

University of Portsmouth UK

Department of Mathematics and Computer Science

(Franchise in Athens)

and

Software Laboratory

National Technical University

of Athens

 

Comments and Interpretation:

 

In this paper we analyze how the practice of time series forecasting for a phenomenon depends on the time scale that we must chose. We discuss the possibility of acceptance of different models for the same phenomenon and data, and in particular for different time scales,  the structural equation modeling and the hierarchical linear  models (HLM). In general if forecasting at an horizon h is required then it is optimal to fit models at the same scale with time bin h rather than at the densest bins of the data, where we might consider there is  more information. We suggest a new method based on the spectrum of the time series. The time scales of best forecasting are defined from the multiple maxima of the spectrum, if they exist. For strongly discrete spectrums this method works even better.  The bins of each scale are defined from the maxima of the spectrum, again. We expand the time series to a superposition of independent component time series that have narrow spectrum only around the maxima of the original time series. We discuss how the standard forecasting (at the shortest time scale) can be represented as a superposition of  forecasting terms each corresponding to the previous mentioned maxima of the spectrum of the time series. Among the many maxima only one selected from specific features gives  the best with minimum error forecasting compared to other scales. This unique peak defines the best forecasting scale, and the relevant expansion term in the series, the best fit model of the time series at this optimal scale. We suggest also a different algorithm based on the time domain and least squares, and not the frequency domain, to find the best time scale for forecasting. We also prove a new theorem on stratified sampling that comes directly from the law of large numbers. We introduce new measures for the forecasting error, than the usual goodness  of  fit  of least squares estimation, and we give examples to show how the above analysis leads to time series forecasting with less error on the same data. If we want to compare this approach with the popular Box-Jenkins Philosophy , it is essentially a technique to find a best fit model without the popular parsimony constraint of small ARMA(p,q) p, q orders, thus models with long term memory, that nevertheless  have better out-of-sample forecasting performance than low (p,q)-order (e.g.1,2 or 3) models. Obviously situations of persistent hidden periodicities of many (e.g. 5-7)  different periods , are such cases and cannot be dealt with low p,q order ARMA(p,q) models. The idea of isolated forecasting at a single time scale permits low order (up to 2) ARMA models, thus per time scale we may apply the Box-Jenkins parsimony, although not in the overall.. An other point of deviation from the Box-Jenkins philosophy is that the method works also for non-stationary time series, with non-stationarities not covered by the popular unit-root methods.  The main idea of using discrete features of the spectrum to find an expansion of the forecasting on different time scales was implemented in statistical studies by the author  during 1999, for phenomena that show hidden periodicities at  various time scales. During 2002 he was supervisor of a master dissertation at the University of Portsmouth on a the topic of how to chose the best  time scale in forecasting.

 

Key words

Statistical forecasting, time series, <