- Instructor
- Course Catalog Description
- Overview
- Goals
- Prerequisites
- Other Requirements
- Availability
- Syllabus
- Download zipped files
- Tailoring the Matlab Scripts
- Accessing files of notes and assignments

Home | Vita | Course | Toolbox | Contact |

Laboratory of Tree-Ring Research, , Room 417, Bryant Bannister Tree-Ring Building (Bldg #45B)

Email: dmeko@LTRR.arizona.edu

Phone: (520) 621-3457

Fax: (520) 621-8229

Office hours Friday, 1:00-6:00 PM (please email to schedule meeting)

The course is 3 credits for students on campus at the University of Arizona in Tucson, and 1-3 credits for online students.

Any time series with a constant time increment (e.g., day, month, year) is a candidate for use in the course. Examples are daily precipitation measurements, seasonal total streamflow, summer mean air temperature, annual indices of tree growth, indices of sea-surface temperature, and the daily height increment of a shrub.

- understand basic time series concepts and terminology
- be able to select time series methods appropriate to goals
- be able to critically evaluate scientific literature applying the time series methods covered
- have improved understanding of time series properties of your own dataset
- be able to concisely summarize results of time series analysis in writing

- An introductory statistics course
- Access to a laptop computer capable of having Matlab installed on it
- Permission of the instructor (undergraduates and online students)

- If you are on a University of Arizona (UA) student on campus in Tucson, you have access to Matlab and required toolboxes through a UA site license as no cost software. No previous experience with Matlab is required, and computer programming is not part of the course.
- If you are an online, not on campus at the UA, you will be able to take the course in Spring 2019 semester as an "iCourse". You must make sure that you have access to Matlab and the required toolboxes (see below) at your location.
- Access to the internet. There is no paper exchange in the course. Notes and assignments are exchanged electronically and completed assignments are submitted electronically through the University of Arizona Desire2Learn (D2L) system.

*Matlab version.* I update scripts and functions now and then using the current site-license release of Matlab. For 2019, I am using MATLAB Version 9.5.0.944444 (R2018b). The updates might include changes I wanted to incorporate, but are also sometimes needed to deal with changes Matlab makes in how its built-in functions work. Beware that some of my current scripts or functions may act quirky or bomb out when run under previous Matlab versions.

Install the whole Matlab package (includes all toolboxes) when installing from the U of A site license. Not all of the toolboxes are needed, but this is the easiest installation. If you are not using the site license, keep in mind that my scripts and functions make extensive use of four toolboxes: Statistics, Signal Processing, System Identification, and Curve Fitting. Functions that used to be in the Spline Toolbox (Matlab released before 2010b) are now in the Curve Fitting Toolbox.

A small number of online students has also usually been accommodated by offering the course in various ways. The way now is the "iCourse" venue described above. Back to Top of Page

- Introduction to time series; organizing data for analysis
- Probability distribution
- Autocorrelation
- Spectrum
- Autoregressive-Moving Average (ARMA) modeling
- Spectral analysis -- smoothed periodogram method
- Detrending
- Filtering
- Correlation
- Lagged Correlation
- Multiple linear regression
- Validating the regression model

The first 1/2 hour of that Tuesday's class is used for guided self-assessment and grading of the assignment and uploading of assessed (graded) assignments to D2L. The remaining 45 minutes are used to introduce the next topic. You must bring your laptop to class on Tuesdays. The 12 lessons or topics covered in the course are listed in the class outline.

Once we are into the assignments on data analysis, Tuesday's class will start with a very brief (3 minute maximum) presentation by a student chosen randomly the previous Thursday. Here, with figures from their submitted assignment put on the screen, the student will describe the time series analyzed and at least one finding from the analysis. Goals of this activity, new in spring 2019,are to 1) introduce the various types of time series being used to the whole class, and 2) give practice in the increasingly popular "lightning talk" format of concisely describing research at conferences.

Online students are expected to follow the same schedule of submitting assignments as the resident students, but do not have access to the lectures. Submitted assignments of online students are not self-assessed, but are graded by me. Online students should have access to D2L for submitting assignments.
*Spring 2019 semester. *Class meets twice a week for 75 minute sessions, 9:00-10:15 AM T/Th, in room 424 (Conference Room) of Bryant Bannister Tree-Ring Building (building 45B). The first day of class is Jan 10 (Thurs). The last day of class is Apr 30 (Tues). There is no class during the week of Spring Break (Mar 2-10). To allow students to fully participate in U of A Earth Week activities, there is also no class during the week of March 25-29.

You submit assignments by uploading them to D2L before the Tuesday class when the next topic is introduced. The first half hour of that Tuesday class is used for guided self-assessment of the assignment, including uploading of self-graded pdfs to D2L. I check one or more of the self-graded assignments each week (by random selection), and may change the grade. To find out how to access assignments, click assignment files.

Assignments, given in class on Thursday, will be due (uploaded to D2L by you) before the start of class the following Tuesday. The first half hour of Tuesday's meeting period will be dedicated to presentation of a grading rubric, self-assessment of completed assignments, and uploading of self-graded assignments to D2L. This schedule gives you 4 days to complete and upload the assignment to D2L before 9:00 am Tuesday. D2L keeps track of the time the assignment was uploaded, and no penalty is assessed as long as it is uploaded before 9:00 AM on Tuesday of the due date.

A late penalty of 3 points is assessed if the assignment is not submitted to D2L by 9 AM Tuesday. A late penalty of 1 point is assessed if the graded assignment is not uploaded to D2L by 5 AM Wednesday, which is when I begin looking over you self-graded assignments. If you have some scheduled need to be away from class (e.g., attendance at a conference), you are responsible for uploading your assignment before 9:00 AM the Tuesday it is due, and for uploading the self-graded version by 10:15 AM the same day. In other words, the schedule is the same as for the students who are in class. If an emergency comes up (e.g., you get the flu) and cannot do the assignment or assessment on schedule, please send me an email and we will reach some accommodation. Otherwise, the late penalties described above will apply.

**Introduction to time series; organizing data for analysis**A time series is broadly defined as any series of measurements taken at different times. Some basic descriptive categories of time series are 1) long vs short, 2) even time-step vs uneven time-step, 3) discrete vs continuous, 4) periodic vs aperiodic, 5) stationary vs nonstationary, and 6) univariate vs multivariate. These properties as well as the temporal overlap of multiple series, must be considered in selecting a dataset for analysis in this course. You will analyze your own time series in the course. The first steps are to select those series and to store them in structures in a mat file. Uniformity in storage at the outset is convenient for this class so that attention can then be focused on understanding time series methods rather debugging computer code to ready the data for analysis. A structure is a Matlab variable similar to a database in that the contents are accessed by textual field designators. A structure can store data of different forms. For example, one field might be a numeric time series matrix, another might be text describing the source of data, etc. In the first assignment you will run a Matlab script that reads your time series and metadata from ascii text files you prepare beforehand and stores the data in Matlab structures in a single mat file. In subsequent assignments you will apply time series methods to the data by running Matlab scripts and functions that load the mat file and operate on those structures.

**Assignments****Select sample data to be use for assignments during the course****Read:**(1) Notes_1.pdf, (2) "Getting Started", accessible from the MATLAB help menu**Answer:**Run script geosa1.m and answer questions listed in the file in a1.pdf**What to Know**- How to distinguish the categories of time series
- How to start and quit MATLAB
- How to enter MATLAB commands at command prompt
- How to create figures in figure window
- How to export figures to your word processor
- Difference between MATLAB scripts and functions
- How to run scripts and functions
- The form of a MATLAB "structure" variable
- How to apply the script geosa1.m to get a set of time series and metadata into MATLAB structures

**Probability distribution**The probability distribution of a time series describes the probability that an observation falls into a specified range of values. An empirical probability distribution for a time series can be arrived at by sorting and ranking the values of the series. Quantiles and percentiles are useful statistics that can be taken directly from the empirical probability distribution. Many parametric statistical tests assume the time series is a sample from a population with a particular population probability distribution. Often the population is assumed to be normal. This chapter presents some basic definitions, statistics and plots related to the probability distribution. In addition, a test (Lilliefors test) is introduced for testing whether a sample comes from a normal distribution with unspecified mean and variance.

**Assignments****Read:**Notes_2.pdf**Answer:**Run script geosa2.m and answer questions listed in the file in a2.pdf**What to Know**- Definitions of terms: time series, stationarity, probability density, distribition function, quantile, spread, location, mean, standard deviation, and skew
- How to interpret the most valuable graphic in time series analysis -- the time series plot
- How to interpret the box plot, histogram and normal probability plot
- Parameters and shape of the normal distribution
- Lilliefors test for normality: graphical description, assumptions, null and alternative hypotheses
- Caveat on interpretation of significance levels of statistical tests when time series not "random" in time
- How to apply geosa2.m to check the distribution properties of a time series and test the series for normality

**Autocorrelation**Autocorrelation refers to the correlation of a time series with its own past and future values. Autocorrelation is also sometimes called

*lagged correlation*or*serial correlation*, which refers to the correlation between members of a series of numbers arranged in time. Positive autocorrelation might be considered a specific form of*persistence*, a tendency for a system to remain in the same state from one observation to the next. For example, the likelihood of tomorrow being rainy is greater if today is rainy than if today is dry. Geophysical time series are frequently autocorrelated because of inertia or carryover processes in the physical system. For example, the slowly evolving and moving low pressure systems in the atmosphere might impart persistence to daily rainfall. Or the slow drainage of groundwater reserves might impart correlation to successive annual flows of a river. Or stored photosynthates might impart correlation to successive annual values of tree-ring indices. Autocorrelation complicates the application of statistical tests by reducing the number of independent observations. Autocorrelation can also complicate the identification of significant covariance or correlation between time series (e.g., precipitation with a tree-ring series). Autocorrelation can be exploited for predictions: an autocorrelated time series is predictable, probabilistically, because future values depend on current and past values. Three tools for assessing the autocorrelation of a time series are (1) the time series plot, (2) the lagged scatterplot, and (3) the autocorrelation function.**Assignments****Read:**Notes_3.pdf**Answer:**Run script geosa3.m and answer questions listed in the file in a3.pdf**What to Know**- Definitions: autocorrelation, persistence, serial correlation, autocorrelation function (acf), autocovariance function (acvf), effective sample size
- How to recognize autocorrelation in the time series plot
- How to use lagged scatterplots to assess autocorrelation
- How to interpret the plotted acf
- How to adjust the sample size for autocorrelation
- Mathematical definition of the autocorrelation function
- Terms affecting the width of the computed confidence band of the acf
- The difference between a one-sided and two-sided test of significant lag-1 autocorrelation
- How to apply geos3.m to study the autocorrelation of a time series

**Spectrum**The spectrum of a time series is the distribution of variance of the series as a function of frequency. The object of spectral analysis is to estimate and study the spectrum. The spectrum contains no new information beyond that in the autocovariance function (acvf), and in fact the spectrum can be computed mathematically by transformation of the acvf. But the spectrum and acvf present the information on the variance of the time series from complementary viewpoints. The acf summarizes information in the time domain and the spectrum in the frequency domain.

**Assignments****Read:**Notes_4.pdf**Answer:**Run script geosa4.m and answer questions listed in the file in a4.pdf**What to Know**- Definitions: frequency, period, wavelength, spectrum, Nyquist frequency, Fourier frequencies, bandwidth
- Reasons for analyzing a spectrum
- How to interpret a plotted spectrum in terms of distribution of variance
- The difference between a spectrum and a normalized spectrum
- Definition of the "lag window" as used in estimating the spectrum by the Blackman-Tukey method
- How the choice of lag window affects the bandwidth and variance of the estimated spectrum
- How to define a "white noise" spectrum and "autoregressive" spectrum
- How to sketch some typical spectral shapes: white noise, autoregressive, quasi-periodic, low-frequency, high-frequency
- How to apply geosa4.m to analyze the spectrum of a time series by the Blackman-Tukey method

**Autoregressive-Moving Average (ARMA)modeling**Autoregressive-moving-average (ARMA) models are mathematical models of the persistence, or autocorrelation, in a time series. ARMA models are widely used in hydrology, dendrochronology, econometrics, and other fields. There are several possible reasons for fitting ARMA models to data. Modeling can contribute to understanding the physical system by revealing something about the physical process that builds persistence into the series. For example, a simple physical water-balance model consisting of terms for precipitation input, evaporation, infiltration, and groundwater storage can be shown to yield a streamflow series that follows a particular form of ARMA model. ARMA models can also be used to predict behavior of a time series from past values alone. Such a prediction can be used as a baseline to evaluate possible importance of other variables to the system. ARMA models are widely used for prediction of economic and industrial time series. ARMA models can also be used to remove persistence. In dendrochronology, for example, ARMA modeling is applied routinely to generate residual chronologies – time series of ring-width index with no dependence on past values. This operation, called prewhitening, is meant to remove biologically-related persistence from the series so that the residual may be more suitable for studying the influence of climate and other outside environmental factors on tree growth.

**Assignments****Read:**Notes_5.pdf**Answer:**Run script geosa5.m and answer questions listed in the file in a5.pdf**What to Know**- The functional form of the simplest AR and ARMA models
- Why such models are referred to as
*autoregressive*or*moving average* - The three steps in ARMA modeling
- The diagnostic patterns of the autocorrelation and partial autocorrelation functions for an AR(1) time series
- Definition of the final prediction error (FPE) and how the FPE is used to select a "best" ARMA model
- Definition of the Portmanteau statistic, and how it and the acf of residuals can be used to assess whether an ARMA model effectively models the persistence in a series
- How the principle of parsimony is applied in ARMA modeling
- Definition of prewhitening
- How prewhitening affects (1) the appearance of a time series, and (2) the spectrum of a time series
- How to apply geosa5.m to ARMA-model a time series

**Spectral analysis -- smoothed periodogram method**There are many available methods for estimating the spectrum of a time series. In lesson 4 we looked at the Blackman-Tukey method, which is based on Fourier transformation of the smoothed, truncated autocovariance function. The smoothed periodogram method circumvents the transformation of the acf by direct Fourier transformation of the time series and computation of the raw periodogram, a function first introduced in the 1800s for study of time series. The raw periodogram is smoothed by applying combinations or

*spans*of one or more filters to produce the estimated spectrum. The smoothness, resolution and variance of the spectral estimates is controlled by the choice of filters. A more accentuated smoothing of the raw periodogram produces an underlying smoothly varying spectrum, or null continuum, against which spectral peaks can be tested for significance. This approach is an alternative to the specification of a functional form of the null continuum (e.g., AR spectrum).**Assignments****Read:**Notes_6.pdf**Answer:**Run script geosa6.m and answer questions listed in the file in a6.pdf**What to Know**- Definitions: raw periodogram, Daniell filter, span of filter, null continuum; smoothness, stability and resolution of spectrum; tapering, padding, leakage
- The four main steps in estimating the spectrum by the smoothed periodogram
- How the effect of choice of filter spans on the smoothness, stability and resolution of the spectrum
- How the null continuum is used in testing for significance of spectral peaks
- How to apply geosa6.m to estimate the spectrum of a time series by the smoothed periodogram method and test for periodicity at a specified frequency

**Detrending**Trend in a time series is a slow, gradual change in some property of the series over the whole interval under investigation. Trend is sometimes loosely defined as a long term change in the mean (Figure 7.1), but can also refer to change in other statistical properties. For example, tree-ring series of measured ring width frequently have a trend in variance as well as mean (Figure 7.2). In traditional time series analysis, a time series was decomposed into trend, seasonal or periodic components, and irregular fluctuations, and the various parts were studied separately. Modern analysis techniques frequently treat the series without such routine decomposition, but separate consideration of trend is still often required. Detrending is the statistical or mathematical operation of removing trend from the series. Detrending is often applied to remove a feature thought to distort or obscure the relationships of interest. In climatology, for example, a temperature trend due to urban warming might obscure a relationship between cloudiness and air temperature. Detrending is also sometimes used as a preprocessing step to prepare time series for analysis by methods that assume stationarity. Many alternative methods are available for detrending. Simple linear trend in mean can be removed by subtracting a least-squares-fit straight line. More complicated trends might require different procedures. For example, the cubic smoothing spline is commonly used in dendrochronology to fit and remove ring-width trend that might not be linear, or not even monotonically increasing or decreasing over time. In studying and removing trend, it is important to understand the effect of detrending on the spectral properties of the time series. This effect can be summarized by the frequency response of the detrending function.

**Assignments****Read:**Notes_7.pdf**Answer:**Run script geosa7.m and answer questions listed in the file in a7.pdf**What to Know**- Definitions: frequency response, spline, cubic smoothing spline
- Pros and cons of ratio vs difference detrending
- Interpretation of terms in the equation for the "spline parameter"
- How to choose a spline interactively from desired frequency response
- How the spectrum is affected by detrending
- How to measure the importance of the trend component in a time series
- How to apply geosa7.m to interactively choose a spline detrending function and detrend a time series

**Filtering**The estimated spectrum of a time series gives the distribution of variance as a function of frequency. Depending on the purpose of analysis, some frequencies may be of greater interest than others, and it may be helpful to reduce the amplitude of variations at other frequencies by statistically filtering them out before viewing and analyzing the series. For example, the high-frequency (year-to-year) variations in a gauged discharge record of a watershed may be relatively unimportant to water supply in a basin with large reservoirs that can store several years of mean annual runoff. Where low-frequency variations are of main interest, it is desirable to smooth the discharge record to eliminate or reduce the short-period fluctuations before using the discharge record to study the importance of climatic variations to water supply. Smoothing is a form of filtering which produces a time series in which the importance of the spectral components at high frequencies is reduced. Electrical engineers call this type of filter a low-pass filter, because the low-frequency variations are allowed to

*pass through*the filter. In a low-pass filter, the low frequency (long-period) waves are barely affected by the smoothing. It is also possible to filter a series such that the low-frequency variations are reduced and the high-frequency variations unaffected. This type of filter is called a high-pass filter. Detrending is a form of high-pass filtering: the fitted trend line tracks the lowest frequencies, and the residuals from the trend line have had those low frequencies removed. A third type of filtering, called band-pass filtering, reduces or filters out both high and low frequencies, and leaves some intermediate frequency band relatively unaffected. In this lesson, we cover several methods of smoothing, or low-pass filtering. We have already discussed how the cubic smoothing spline might be useful for this purpose. Four other types of filters are discussed here: 1) simple moving average, 2) binomial, 3) Gaussian, and 4) windowing (Hamming method). Considerations in choosing a type of low-pass filter are the desired frequency response and the span, or width, of the filter.**Assignments****Read:**Notes_8.pdf**Answer:**Run script geosa8.m and answer questions listed in the file in a8.pdf**What to Know**- Definitions: filter, filter weights, filter span, low-pass filter, high-pass filter, band-pass filter; frequency response of a filter
- How the Gaussian filter is related to the Gaussian distribution
- How to build a simple binomial filter manually (without the computer)
- How to describe the frequency response function in terms of a system with sinusoidal input and output
- How to apply geosa8.m to interactively design a Gaussian, binomial or Hamming-window lowpass filter for a time series

**Correlation**The Pearson product-moment correlation coefficient is probably the single most widely used statistic for summarizing the relationship between two variables. Statistical significance and caveats of interpretation of the correlation coefficient as applied to time series are topics of this lesson. Under certain assumptions, the statistical significance of a correlation coefficient depends on just the sample size, defined as the number of independent observations. If time series are autocorrelated, an

*effective*sample size, lower than the actual sample size, should be used when evaluating significance. Transient or spurious relationships can yield significant correlation for some periods and not for others. The time variation of strength of linear correlation can be examined with plots of correlation computed for a sliding window. But if many correlation coefficients are evaluated simultaneously, confidence intervals should be adjusted (*Bonferroni adjustment*) to compensate for the increased likelihood of observing some high correlations where no relationship exists. Interpretation of sliding correlations can be also be complicated by time variations of mean and variance of the series, as the sliding correlation reflects covariation in terms of standardized departures from means in the time window of interest, which may differ from the long-term means. Finally, it should be emphasized that the Pearson correlation coefficient measures strength of linear relationship. Scatterplots are useful for checking whether the relationship is linear.**Assignments****Read:**Notes_9.pdf**Answer:**Run script geosa9.m and answer questions listed in the file in a9.pdf**What to Know**- Mathematical definition of the correlation coefficient
- Assumptions and hypothesis for significance testing of correlation coefficient
- How to compute significance level of correlation coefficient and to adjust the significance level for autocorrelation in the individual time series
- Caveats to interpretation of correlation coefficient
- Bonferroni adjustment to signficance level of correlation under multiple comparisons
- Inflation of variance of estimated correlation coefficient when time series autocorrelated
- Possible effects of data transformation on correlation
- How to interpret plots of sliding correlations
- How to apply geosa9.m to analyze correlations and sliding correlations between pairs of time series

**Lagged correlation**Lagged relationships are characteristic of many natural physical systems. Lagged correlation refers to the correlation between two time series shifted in time relative to one another. Lagged correlation is important in studying the relationship between time series for two reasons. First, one series may have a delayed response to the other series, or perhaps a delayed response to a common stimulus that affects both series. Second, the response of one series to the other series or an outside stimulus may be

*smeared*in time, such that a stimulus restricted to one observation elicits a response at multiple observations. For example, because of storage in reservoirs, glaciers, etc., the volume discharge of a river in one year may depend on precipitation in the several preceding years. Or because of changes in crown density and photosynthate storage, the width of a tree-ring in one year may depend on climate of several preceding years. The simple correlation coefficient between the two series properly aligned in time is inadequate to characterize the relationship in such situations. Useful functions we will examine as alternative to the simple correlation coefficient are the cross-correlation function and the impulse response function. The cross-correlation function is the correlation between the series shifted against one another as a function of number of observations of the offset. If the individual series are autocorrelated, the estimated cross-correlation function may be distorted and misleading as a measure of the lagged relationship. We will look at two approaches to clarifying the pattern of cross-correlations. One is to individually remove the persistence from, or prewhiten, the series before cross-correlation estimation. In this approach, the two series are essentially regarded on*equal footing*. An alternative is the*systems*approach: view the series as a dynamic linear system -- one series the input and the other the output -- and estimate the impulse response function. The impulse response function is the response of the output at current and future times to a hypothetical*pulse*of input restricted to the current time.**Assignments****Read:**Notes_10.pdf**Answer:**Run script geosa10.m and answer questions listed in the file in a10.pdf**What to Know**- Definitions: cross-covariance function, cross-correlation function, impulse response function, lagged correlation, causal, linear
- How autocorrelation can distort the pattern of cross-correlations and how prewhitening is used to clarify the pattern
- The distinction between the 'equal footing' and 'systems' approaches to lagged bivariate relationships
- Which types of situations the impulse response function (irf) is an appropriate tool
- How to represent the causal system treated by the irf in a flow diagram
- How to apply geos10.m to analyze the lagged cross-correlation structure of a a pair of time series

**Multiple linear regression**Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable and one or more independent variables. The dependent variable is sometimes also called the predictand, and the independent variables the predictors. MLR is based on least squares: the model is fit such that the sum-of-squares of differences of observed and predicted values is minimized. MLR is probably the most widely used method in dendroclimatology for developing models to reconstruct climate variables from tree-ring series. Typically, a climatic variable is defined as the predictand and tree-ring variables from one or more sites are defined as predictors. The model is fit to a period -- the calibration period -- for which climatic and tree-ring data overlap. In the process of fitting, or estimating, the model, statistics are computed that summarize the accuracy of the regression model for the calibration period. The performance of the model on data not used to fit the model is usually checked in some way by a process called validation. Finally, tree-ring data from before the calibration period are substituted into the prediction equation to get a reconstruction of the predictand. The reconstruction is a "prediction" in the sense that the regression model is applied to generate estimates of the predictand variable outside the period used to fit the data. The uncertainty in the reconstruction is summarized by confidence intervals, which can be computed by various alternative ways.

**Assignments****Read:**Notes_11.pdf**Answer:**Run script geosa11.m (Part 1) and answer questions listed in the file in a11.pdf**What to Know**- The equation for the MLR model
- Assumptions for the MLR model
- Definitions of MLR statistics: coefficient of determination, sums-of-squares terms, overall-F for the regression equation, standard error of the estimate, adjusted R-squared, pool of potential predictors
- The steps in an analysis of residuals
- How to apply geosa11.m (part 1) to fit a MLR regression model to predict one variable from a set of several predictor variables

**Validating the regression model**Regression R-squared, even if adjusted for loss of degrees of freedom due to the number of predictors in the model, can give a misleading, overly optimistic view of accuracy of prediction when the model is applied outside the calibration period. Application outside the calibration period is the rule rather than the exception in dendroclimatology. The calibration-period statistics are typically biased because the model is "tuned" for maximum agreement in the calibration period. Sometimes too large a pool of potential predictors is used in automated procedures to select final predictors. Another possible problem is that the calibration period itself may be anomalous in terms of the relationships between the variables: modeled relationships may hold up for some periods of time but not for others. It is advisable therefore to "validate" the regression model by testing the model on data not used to fit the model. Several approaches to validation are available. Among these are cross-validation and split-sample validation. In cross-validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict the predictand for the deleted observation. The merged series of predictions for deleted observations is then checked for accuracy against the observed data. In split-sample calibration, the model is fit to some portion of the data (say, the second half), and accuracy is measured on the predictions for the other half of the data. The calibration and validation periods are then exchanged and the process repeated. In any regression problem it is also important to keep in mind that modeled relationships may not be valid for periods when the predictors are outside their ranges for the calibration period: the multivariate distribution of the predictors for some observations outside the calibration period may have no analog in the calibration period. The distinction of predictions as extrapolations versus interpolations is useful in flagging such occurrences.

**Assignments****Read:**Notes_12.pdf**Answer:**Run script geosa11.m (Part 2) and answer questions listed in the file in a12.pdf**What to Know**- Definitions: validation, cross-validation, split-sample validation, mean square error (MSE), root-mean-square error (RMSE); standard error of prediction, PRESS statistic, "hat" matrix, extrapolation vs interpolation
- Advantages of cross-validation over alternative validation methods
- How to apply geosa11.m (part 2) for cross-validated MLR modeling of the relationship between a predictand and predictors, including generation of a reconstruction and confidence bands

*Powerpoint lecture outlines & miscellaneous files.* Downloadable file other_Stale.zip has miscellaneous files used in lectures from the previous offering of the course. Included are Matlab demo scripts, sample data files, user-written functions used by demo scripts, and powerpoint presentations, as pdfs (lect1a.pdf, lect1b.pdf, etc.) used in on-campus lectures. Students taking the course this semester should not use other_Stale.zip, but instead get the file "other.zip" from D2L contents. I update other.zip over the semester, and add the presentation for the current lecture within a couple of days after that lecture is given. File other.zip for this semester does not exist till after the first lecture, and then is augmented after each lecture. At the end of the semester I revise the online-available other_Stale.zip.

Home | Vita | Course | Toolbox | Contact |

I am happy to share my notes, and anyone not taking the course is welcome to download them and modify them for their own purposes. No attribution or acknowledgement of source is requested. Enjoy! Click on the zip file that contains the pfds of notes for each lecture for the previous semester I taught the course.

Back to Top of Page

Home | Vita | Course | Toolbox | Contact |