Task: What is the linear relationship between the two time series in the C7.dat data file. The first column is time (days) while the second and third columns are measurements of two different quantities at the same location. Write a script to read this file and calculate the correlation between the two measurements. Plot the two time series on one graph. Place the correlation value somewhere on the figure.
A cross-correlation coefficient measures the relationship between to sets of numbers. Traditionally, this is applied to two time series of measurement of similar quantities at two locations or measurements of different quantities at the same location. In all cases the measurements must be at the same time; that is, the first measurement at the two places must be at the same time and so forth.
The correlation calculation is done by subtracting the mean from each of the two time series and then calculating the average of the product of the corresponding values in the two demeaned time series. The average is divided by the standard deviation of each of the two time series, which scales the coefficient to be between -1 and +1.
If both time series increase above the mean (and decrease below) at the same times, then the coefficient will be positive, meaning that the variations are in the same direction (compared to the mean) at the same times. The coefficient is negative if the variations are in opposite directions.
Since the coefficient is scaled by the standard deviations, the size of the deviations is not relevant, only the sign (direction).
Another way to think of this is that each of the time series in scaled to have zero mean and a variance of one. Then the correlation is the average of the product of the corresponding values of the two scaled time series.
The function corr() calculates the correlation coefficient from input arrays, which can be provided in two ways. If there are two individual time series of the same length, then the command
Cab=corr(A,B);will return the correlation (a single number) between the two time series. The input arrays are scaled in the function.
A second method requires that the time series be provided as a matrix, where each of the columns in one of the time series. The corr() function calculates the correlation between all possible combinations of the columns of the input matrix.
If there are three time series, then
% A, B, C are column vectors of data of the same length. M=[A B C]; % combine data into a matrix Cxy=corr(M); % Cxy is a 3x3 matrix of correlation coefficientsThe function calculates all possible correlation coefficients: A vs A, A vs B, A vs C, B vs A, etc. The diagonal entries are correlations of the vectors with themselves which are always 1. The upper and lower off-diagonal entries are symmetric because the correlation of A vs B has the same correlation Cxy(1,2) as B vs A or Cxy(2,1).
The corr() function will also report a test of significance for the correlation with a second output variable, This value is from a student's t test, which is valid if the input series have normally distributed variations about the mean.
[cxy p]=corr(M);The help page says that if p < 0.05, then the correlation is likely significant.
If there is a trend in the time series, then the correlation coefficient can be large, even if the smaller variations are unrelated. In this case, it would be useful to remove the trend before calculating the correlation, depending on your interest in the relationship between slow or fast variations in the time series.
A second method of correlation (Spearman's or ranked) avoids the difficulty of data with large ranges of values or with outliers. This correlation uses the rank of the data (the order that the values appear in a list if the data is sorted by size). This method reduces problems with non-normal distributions of variations. This method is invoked with
[cxy p]=corr(M,'type','Spearman'); % rank correlationSee help corr for more information.
The correlation coefficient estimates a linear relationship between two time series of the form Y = a X + b. That is, how well can we reconstruct Y directly from X?
The curve fitting routines from before are useful to do the following:
p=polyfit(X,Y,1); % linear fit Yf=polyval(p,X); % reconstruct Y from the fit % compare original and reconstructed series plot(t,Y,'r',t,Yf,'k'); title('Original (r) Reconstructed (k)')You will notice that the slope of the curve (p(1)) in the linear fit is close to the correlation coefficient. The reconstructed curve is similar to the original, but has differences. (Give it a try!)
The flow chart for the task is
%%% read the data file %%% separate the variables %%% calculate the correlation coefficient %%% plot the two time series %%% add title and labels %%% put the correlation coefficient on the figure
The script to solve the task is
%%% read the data file data=load('data/C7.dat'); %%% separate the variables time=data(:,1);A=data(:,2);B=data(:,3); %%% calculate the correlation coefficient C=corr(A,B); %%% plot the two time series figure plot(time,A,'r',time,B,'k') %%% add title and labels title('Time series A (r) and B (b)') xlabel('time (days)') ylabel('values') %%% put the correlation coefficient on the figure text(5,.8,['correlation is ' num2str(C)])