From: Warren Sarle <saswss@UNX.SAS.COM>
Subject:      Re: Neural Network Modeling Q's

Below is an excerpt that I wrote for the comp.ai.neural-nets FAQ, a
discussion of neural nets and structural equation models, a compilation
of neural net and statistical jargon, and directions for obtaining more
information via ftp.
______________________________________________________________________

Q: How are neural networks related to statistical methods?

A: There is considerable overlap between the fields of neural
networks and statistics.

Statistics is concerned with data analysis. In neural network
terminology, statistical inference means learning to generalize from
noisy data. Some neural networks are not concerned with data analysis
(e.g., those intended to model biological systems) and therefore have
little to do with statistics. Some neural networks do not learn (e.g.,
Hopfield nets) and therefore have little to do with statistics. Some
neural networks can learn successfully only from noise-free data (e.g.,
ART or the perceptron rule) and therefore would not be considered
statistical methods. But most neural networks that can learn to
generalize effectively from noisy data are similar or identical to
statistical methods. For example:

 * Feedforward nets with no hidden layer (including functional-link
   neural nets and higher-order neural nets) are basically
   generalized linear models.

 * Feedforward nets with one hidden layer are closely related
   to projection pursuit regression.

 * Probabilistic neural nets are identical to kernel discriminant
   analysis.

 * General regression neural nets are identical to Nadaraya-Watson
   kernel regression.

 * Kohonen nets for adaptive vector quantization are very similar
   to k-means cluster analysis.

 * Hebbian learning is closely related to principal component
   analysis.

Some neural network areas that appear to have no close relatives in the
existing statistical literature are:

 * Kohonen's self-organizing maps.

 * Reinforcement learning (although this is treated in the
   operations research literature as Markov decision processes).

 * Stopped training (the purpose and effect of stopped training are
   similar to shrinkage estimation, but the method is quite different).

Feedforward nets are a subset of the class of nonlinear regression and
discrimination models. Statisticians have studied the properties of this
general class but had not considered the specific case of feedforward
neural nets before such networks were popularized in the neural network
field. Still, many results from the statistical theory of nonlinear
models apply directly to feedforward nets, and the methods that are
commonly used for fitting nonlinear models, such as various
Levenberg-Marquardt and conjugate gradient algorithms, can be used to
train feedforward nets.

While neural nets are often defined in terms of their algorithms or
implementations, statistical methods are usually defined in terms of
their results. The arithmetic mean, for example, can be computed by a
(very simple) backprop net, by applying the usual formula SUM(x_i)/n, or
by various other methods. What you get is still an arithmetic mean
regardless of how you compute it. So a statistician would consider
standard backprop, Quickprop, and Levenberg-Marquardt as different
algorithms for implementing the same statistical model such as a
feedforward net. On the other hand, different training criteria, such as
least squares and cross entropy, are viewed by statisticians as
fundamentally different estimation methods with different statistical
properties.

It is sometimes claimed that neural networks, unlike statistical models,
require no distributional assumptions. In fact, neural networks involve
exactly the same sort of distributional assumptions as statistical
models, but statisticians study the consequences and importance of these
assumptions while most neural networkers ignore them. For example,
least-squares training methods are widely used by statisticians and
neural networkers. Statisticians realize that least-squares training
involves implicit distributional assumptions in that least-squares
estimates have certain optimality properties for noise that is normally
distributed with equal variance for all training cases and that is
independent between different cases. These optimality properties are
consequences of the fact that least-squares estimation is maximum
likelihood under those conditions. Similarly, cross-entropy is maximum
likelihood for noise with a Bernoulli distribution. If you study the
distributional assumptions, then you can recognize and deal with
violations of the assumptions. For example, if you have normally
distributed noise but some training cases have greater noise variance
than others, then you may be able to use weighted least squares instead
of ordinary least squares to obtain more efficient estimates.




                  Neural Network and Statistical Jargon
                  =====================================

         Warren S. Sarle  saswss@unx.sas.com  May 12, 1995

The neural network (NN) and statistical literatures contain many of the
same concepts but usually with different terminology. Sometimes the same
term or acronym is used in both literatures but with different meanings.
Only in very rare cases is the same term used with the same meaning,
although some cross-fertilization is beginning to happen. Below is a
list of such corresponding terms or definitions.

Particularly loose correspondences are marked by a ~ between the two
columns. A < indicates that the term on the left is roughly a subset of
the term on the right, and a > indicates the reverse. Terminology in
both fields is often vague, so precise equivalences are not always
possible.  The list starts with some basic definitions.

There is disagreement in the NN literature on how to count layers. Some
people count inputs as a layer and some don't. I specify the number of
hidden layers instead. This is awkward but unambiguous.


Definition                     Statistical Jargon
==========                     ==================

generalizing from noisy data   Statistical inference
and assessment of the
accuracy thereof

the set of all cases one       Population
wants to be able to
generalize to

a function of the values in    Parameter
a population, such as the
mean or a globally optimal
synaptic weight

a function of the values in    Statistic
a sample, such as the mean
or a learned synaptic weight



Neural Network Jargon          Definition
=====================          ==========

Neuron, neurode, unit,         a simple linear or nonlinear computing
node, processing element       element that accepts one or more inputs,
                               computes a function thereof, and may
                               direct the result to one or more other
                               neurons

Neural networks                a class of flexible nonlinear regression
                               and discriminant models, data reduction
                               models, and nonlinear dynamical systems
                               consisting of an often large number of
                               neurons interconnected in often complex
                               ways and often organized into layers


Neural Network Jargon          Statistical Jargon
=====================          ==================

Statistical methods            Linear regression and discriminant
                               analysis, simulated annealing, random
                               search

Architecture                   Model

Training, Learning,            Estimation, Model fitting, Optimization
Adaptation

Classification                 Discriminant analysis

Mapping, Function              Regression
approximation

Supervised learning            Regression, Discriminant analysis

Unsupervised learning,         Principal components, Cluster analysis,
Self-organization              Data reduction

Competitive learning           Cluster analysis

Hebbian learning,              Principal components
Cottrell/Munro/Zipser
technique

Training set                   Sample, Construction sample

Test set, Validation set       Hold-out sample

Pattern, Vector, Case          Observation, Case

Reflectance pattern            an observation normalized to sum to 1

Binary(0/1),                   Binary, Dichotomous
Bivalent or Bipolar(-1/1)

Input                          Independent variables, Predictors,
                               Regressors, Explanatory variables,
                               Carriers

Output                         Predicted values

Training values                Dependent variables, Responses,
Target values                  Observed values

Training pair                  Observation containing both inputs
                               and target values

Shift register,                Lagged variable
(Tapped) (time) delay (line),
Input window

Errors                         Residuals

Noise                          Error term

Generalization                 Interpolation, Extrapolation,
                               Prediction

Error bars                     Confidence interval

Prediction                     Forecasting

Adaline                        Linear two-group discriminant analysis
(ADAptive LInear NEuron)       (not Fisher's but generic)

(No-hidden-layer) perceptron ~ Generalized linear model (GLIM)

Activation function,         > Inverse link function in GLIM
Signal function,
Transfer function

Softmax                        Multiple logistic function

Squashing function             bounded function with infinite domain

Semilinear function            differentiable nondecreasing function

Phi-machine                    Linear model

Linear 1-hidden-layer          Maximum redundancy analysis, Principal
perceptron                     components of instrumental variables

1-hidden-layer perceptron    ~ Projection pursuit regression

Weights,                     < (Regression) coefficients,
Synaptic weights               Parameter estimates

Bias                         ~ Intercept

the difference between the     Bias
expected value of a statistic
and the corresponding true
value (parameter)

Shortcuts, Jumpers,          ~ Main effects
Bypass connections,
direct linear feedthrough
(direct connections from
input to output)

Functional links               Interaction terms or transformations

Second-order network           Quadratic regression,
                               Response-surface model

Higher-order network           Polynomial regression, Linear model
                               with interaction terms

Instar, Outstar                iterative algorithms of doubtful
                               convergence for approximating an
                               arithmetic mean or centroid

Delta rule, adaline rule,      iterative algorithm of doubtful
Widrow-Hoff rule,              convergence for training a linear
LMS (Least Mean Squares) rule  perceptron by least squares, similar
                               to stochastic approximation

training by minimizing the     LMS (Least Median of Squares)
median of the squared errors

Generalized delta rule         iterative algorithm of doubtful
                               convergence for training a nonlinear
                               perceptron by least squares, similar
                               to stochastic approximation

Backpropagation                Computation of derivatives for a
                               multilayer perceptron and various
                               algorithms such as the generalized
                               delta rule based thereon

Weight decay, Regularization > Shrinkage estimation, Ridge regression

Jitter                         random noise added to the inputs to
                               shrink the estimates

Growing, Pruning, Brain        Subset selection, Model selection,
damage, Self-structuring,      Pre-test estimation
Ontogeny

Optimal brain surgeon          Wald test

LMS (Least mean squares)       OLS (Ordinary least squares)
                               (see also "LMS rule" above)

Relative entropy, Cross        Kullback-Leibler divergence
entropy

Evidence framework             Empirical Bayes estimation

OLS (Orthogonal least squares) Forward stepwise regression

Probabilistic neural network   Kernel discriminant analysis

General regression neural      Kernel regression
network

Topologically distributed    < (Generalized) Additive model
encoding

Adaptive vector quantization   iterative algorithms of doubtful
                               convergence for K-means cluster analysis

Adaptive Resonance Theory 2a ~ Hartigan's leader algorithm

Learning vector quantization   a form of piecewise linear discriminant
                               analysis using a preliminary cluster
                               analysis

Counterpropagation             Regressogram based on k-means clusters

Encoding, Autoassociation      Dimensionality reduction
                               (Independent and dependent variables
                               are the same)

Heteroassociation              Regression, Discriminant analysis
                               (Independent and dependent variables
                               are different)

Epoch                          Iteration

Continuous training,           Iteratively updating estimates one
Incremental training,          observation at a time via difference
On-line training,              equations, as in stochastic approximation
Instantaneous training

Batch training,                Iteratively updating estimates after
Off-line training              each complete pass over the data as in
                               most nonlinear regression algorithms


=======================================================================
References:

   Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994)
   "A study of the classification capabilities of neural networks using
   unsupervised learning: A comparison with k-means clustering",
   Psychometrika, 59, 509-525.

   Chatfield, C. (1993), "Neural networks: Forecasting breakthrough or
   passing fad", International Journal of Forecasting, 9, 1-3.

   Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review
   from a Statistical Perspective", Statistical Science, 9, 2-54.

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks
   and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

   Kuan, C.-M. and White, H. (1994), "Artificial Neural Networks: An
   Econometric Perspective", Econometric Reviews, 13, 1-91.

   Kushner, H. & Clark, D. (1978), _Stochastic Approximation Methods for
   Constrained and Unconstrained Systems_, Springer-Verlag.

   Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994), _Machine
   Learning, Neural and Statistical Classification_, Ellis Horwood.

   Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in
   O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds.,
   _Networks and Chaos: Statistical and Probabilistic Aspects_,
   Chapman & Hall. ISBN 0 412 46530 2.

   Ripley, B.D. (1994), "Neural Networks and Related Methods for
   Classification," Journal of the Royal Statistical Society, Series B,
   56, 409-456.

   Sarle, W.S. (1994), "Neural Networks and Statistical
   Models," Proceedings of the Nineteenth Annual SAS Users
   Group International Conference, Cary, NC: SAS Institute,
   pp 1538-1550.

   White, H. (1989), "Learning in Artificial Neural Networks: A
   Statistical Perspective," Neural Computation, 1, 425-464.

   White, H. (1989), "Some Asymptotic Results for Learning in Single
   Hidden Layer Feedforward Network Models", J. of the American Statistical
   Assoc., 84, 1008-1013.

   White, H. (1992), _Artificial Neural Networks: Approximation and
   Learning Theory_, Blackwell.