CONTINUOUS DATA ANALYSIS

Initial Approaches to Continuous Data: Means, Univariate, Boxplot

Review I

Normality

Central Limit Theorem

Statistics

difference between \(\sigma\) and s.e.

Useful %’s for Nl distn

PROC MEANS (review)

PROC MEANS data =  <options>;
VAR  variables;
RUN;
/* for example...*/
proc means data= your.data maxdec=4 n mean median std var q1 q2 clm stderr;  
  var num_var;
  run;

Review II

skewness

kurtosis

quantiles

probability plots

PROC UNIVARIATE

PROC UNIVARIATE data =
       VAR num_var if no VAR, will analyze all variables;
        ID var  option so can id 5 lowest and 5 highest obs; 
        HISTOGRAM num_var
        PROBPLOT  num_var
                goptions reset=all fontres=presentation ftext=swissb htext=1.5; 
         /* options for the graph* /

/* e.g. */

proc univariate data = your.data mu0=n;
       var num_var;
        id idnumber;
        probplot num_var / normal (mu=est  sigma= est color=blue w=1);
        title;
run;

PROC BOXPLOT

PROC BOXPLOT Data = ;
PLOT analysis var  group var  / options; 
/* NB- data must be sorted by grouping var*/
RUN;


data work;
set work;
dummy = ‘1’;      /* creating dummy variable to do boxplot on just one group */
run;


proc boxplot data = work;
       plot cont_vardummy / boxstyle schematic  cboxes=black;
run;


symbol color = salmon; 
   title ‘Nice Format Boxplot';
   proc boxplot data=work;
plot cont_varcat_var / cframe   = vligb 
                       cboxes   = dagr
                       cboxfill = ywh;
run;

ANOVA AND LINEAR REGRESSION

ANOVA Using  PROC GLM

Review

ANOVA

Sums of Squares

“Classic” ANOVA

AnovaTable

AnovaTable

assumptions

are your data consistent with assumptions?

proc univariate data = your data;
     class cat_var;
     var cont_var;
     probplot cont_var / normal  (mu=est sigma=est color=blue w=1);
     title ‘univariate analysis by categorical variable’;
run;



proc sort data=your.data out=sorted_data; /* need to sort data by your  categorical variable* /
    by cat_var
run;

proc boxplot data=sorted_data;
       plot cont_varcat_var  / cboxes=black boxstyle=schematic;
run;

Using error terms to test assumptions

another view of ANOVA

error terms

PROC GLM 

PROC GLM  data =    ;
      CLASS  grouping variables
      MODEL  specify the model i.e. response variable = predictors
      MEANS computes means for each group
      LSMEANS to test specific group differences
      OUTPUT (out =) creates output data set of residuals
RUN;
QUIT;  /*note, GLM is like GPLOT need to end it or remains invoked */


/* SYNTAX TEMPLATE */
options ls=75 ps=45;  /* page and line spacing options*/
PROC GLM data = your.data;
       CLASS cat_var;
       MODEL num_var=cat_var;
       MEANS cat_var / hovtest;
       OUTPUT out=check r=resid   p=pred;  /* creating an output data
set called ‘check’ in which ‘r ‘ is a keyword SAS recognizes as
residuals and ‘p’ recognized as predictors */
title ‘testing for equality of means with GLM’;
RUN;
QUIT;



/* now run gplot on the ‘check’ dataset created above*/

PROC GPLOT data=check;
         PLOT residpred / haxis=axis1 vaxis=axis2 vref=0; /* can
leave out if ok with  defaults*/
         axis1 w=2 major=(w=2) minor=none offset=(10pct);
         axis2 w=2 major=(w=2) minor=none;
         TITLE ‘plot residuals vs predictors for cereal’;
RUN;
QUIT;

/* now run proc univariate to get histogram, normal plot, kurtosis and skewness on residuals*/

PROC UNIVARIATE data = check normal;
      VAR resid;
      HISTOGRAM / normal;
      PROBPLOT/ mu=est sigma=est color=blue w=1;
      TITLE;
RUN;

GLM ANOVA Output

Finding the Group(s) that differ(s)

Caution experiment-wise vs. comparison-wise error rate

Requesting GLM Multiple Comparisons

PROC GLM data=your.data;
   CLASS cat_var;
   MODEL cont_var=cat_var;
   LSMEANS cat_var / pdiff=all adjust=tukey; /*adjust=bon*/
   /* pdiff=all requests all pairwise p values */  
TITLE ‘Data: Multiple Comparisons';
RUN;
QUIT;

Accounting for More than 1 Categorical Variable: n-Way ANOVA and Interaction Effects

Example: do different levels of a drug have differing effect on different diseases?

\[ y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk} \]

where

\(y_{ijk}\) is the observed blood pressure for each subject
\(\mu\) is the overall population mean blood pressure
\(\alpha_i\) is the effect of disease \(i\)
\(\beta_J\) is the effect of drug dose \(j\)
\((\alpha \beta)_{ij}\) interaction term between disease \(i\) and dose \(j\)
\(\epsilon_{ijk}\) is the residual or error term for each subject

Syntax for n-Way ANOVA

/* Pre-ANOVA PROC MEANS* 

proc means data=sasuser.b_drug mean var std;
   class disease drug; 
   var bloodp;
   title 'Selected Descriptive Statistics for drug-disease
combinations';
run;


/* Means Plot To Illustrate Results*/

proc gplot data=sasuser.b_drug;
   symbol c=blue w=2 interpol=std1mtj line=1;/* interpolation method
gives s.e. bars */
   symbol2 c=green w=2 interpol=std1mtj line=2;
   symbol3 c=red w=2 interpol=std1mtj line=3;
   plot bloodpdrug=disease; / vertical by horizontal /
   title 'Illustrating the Interaction Between Disease and Drug';
run;
quit;

/*Syntax for Interaction Term*/

proc glm data=sasuser.b_drug;
   class disease drug;
   model bloodp=disease drug diseasedrug;/note interaction
term/
   title 'Analyze the Effects of Drug and Disease';
   title2 'Including Interaction';
run;
quit;



/* LSMEANS Syntax with Interaction */
/* must look at all combinations of drug and disease*/

proc glm data=sasuser.b_drug;
   class disease drug;
   model bloodp=drug disease drugdisease;
   lsmeans diseasedrug / adjust=tukey pdiff=all; /note looking at
combinations/
   title 'Multiple Comparisons Tests for Drug and Disease';
run;
quit;

A Look at Interaction in Epidemiology

(See DiMaggio Chapter 11, section 6…)

CORRELATION with PROC CORR

Pearson correlation coefficient

\(r = \Sigma (x_i - \bar{x}) * (y_i - \bar{y}) / \sqrt{\Sigma(x_i - \bar(x)^2 * (\bar{y})^2)}\)

(1) Non-Linear Relationship, (2) Influential Observation

(1) Non-Linear Relationship, (2) Influential Observation

PROC CORR Syntax

PROC CORR DATA=work rank;  /* ‘rank’ orders  
                                correlations from high to low */
        VAR predictor1 predictor2 predictor3 predictor4
        WITH outcome_var;
        TITLE ‘correlation of outcome with predictors’;
RUN;

/*to get a correlation matrix, 
  omit the ‘with’ statement */

SIMPLE LINEAR REGRESSION  with PROC REG

Regression

\[ y=\beta_0 + \beta_1 x + \epsilon \]

Where,

\(\beta_0\) is the outcome when the predictor \(x\) is 0
\(\beta_1\) is the slope (the `rise over the run’), or amount of change in the outcome \(y\) per unit change in the predictor \(x\)
\(\epsilon\) error term (defined as it was with ANOVA)

Method of Least Squares

\(\hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}\)

\(\hat{\beta_1} = \Sigma(x_i-\bar{x})(y_i-\bar{y}) / \Sigma(x_i-\bar{x})^2\)

Analysis of Variance Perspective

Total Variability (TSS) Distance from Data Points to Mean Line: \(\Sigma (y_i-\bar{y})^2\)
Explained Variability (SSM) Distance from Regression Line to Mean Line: \(\Sigma(\hat{y}-\bar{y})^2\)
Unexplained Variability (SSE) Distance from Data Points to Regression Line: \(\Sigma(y_i-\hat{y})^2\)

As in ANOVA, if it’s a good model, we expect more explained than un-explained variability and the proportion of the total variability explained by our regression model to be closer to 1:

Assumptions

Residuals

PROC REG

PROC REG DATA=work;
          MODEL cont_outcome_var = cont_pred_var;
          TITLE ‘simple linear regression’;
RUN;
QUIT; / need to quit out of the procedure /

Other tools in SAS will give regressions, but PROC REG is convenient with rich set of tools

Basically a call to PROC REG and a model statement

Proc Reg Output

Producing predicted values:

  1. create a temporary data set with the values you want predictions for
  2. append those values to the data set you used to create the model
  3. calculate the values by specifying ‘p’ (for ‘predict) option in the model statement

Multiple Regression in PROC REG

Dummy Variables

REGRESSION DIAGNOSTICS

Common Problems in Linear Regression

Residual Plots

Outliers and Influential Values 

Studentized Residuals

Jacknife Residual (RSTUDENT)

Cook’s D Statistic

Requesting Residuals Plots in SAS

plot r.  p.
/* SAS recoginizes r. and p. as referring to residuals and predictors*/

plot student.  obs.
/* SAS recognizes student. as studentized residual and obs. as* observation number /

plot student.  nqq.
/* nqq. refers to normal quantile values, another name for a normal probability plot */  

/* example syntax*/

options ps=50 ls=97;

goptions reset=all fontres=presentation ftext=swissb htext=1.5;

 

PROC REG data=fitness;
MODEL oxygen_consumption = runtime age run_pulse maximum_pulse;
   PLOT r.(p. runtime age run_pulse maximum_pulse); /*plot residuals vs. predicted values */                                                
   PLOT student.obs. / vref=3 2 -2 -3     /* studentized obs. Gives obs. # to ID */
                        haxis=0 to 32 by 1;
   PLOT student.nqq.; /nqq another name for normal prob plot/
   symbol v=dot;
   TITLE ‘Plots of Diagnostic Statistics';
RUN;
QUIT;

Requesting Outlier Statistics in SAS

/* MACRO FOR OUTLIERS */

/*  set the values of these macro variables, */
/*  based on your data and model.            */
%let numparms=5;  /* # of predictor variables + 1 */ 
%let numobs=31;   /* # of observations */
%let idvars=name; /* relevant identification variable(s) */

data influential;
   set ck4outliers; 
   cutcookd=4/&numobs;

   rstud_i=(abs(rstud)>3);
   cookd_i=(cooksd>cutcookd);
   sum_i=rstud_i + cookd_i;
   if sum_i > 0;
run;

/* then print out the list of influential observations */

proc print data=influential;
   var sum_i &idvars cooksd rstud cutcookd 
       cookd_i rstud_i;
   title 'Observations that Exceed Suggested Cutoffs';
run;

How to handle influential observations:

Collinearity: Variance Inflation Factor

About Model Building

SAS Automated Methods

Problems with automated approach

Sample Automated Selection Syntax

PROC REG data=work.data;

  Forward:  model outcome=pred1 pred2 pred3 / SELECTION=FORWARD
SLENTRY=0.001;

  Backward:  model outcome=pred1 pred2 pred3 / SELECTION=BACKWARD
SLENTRY=0.001;

  Stepwise:  model outcome=pred1 pred2 pred3 / SELECTION=STEPWISE
SLENTRY=0.001;

Other (Better) Approaches to Model Building

Partial F-test

Using partial F-tests in SAS

model  y = a b c d
/ TEST c=0 d=0 

Introducing Logistic Regression

Why Logistic Regression

PROC MEANS data=lbw;
CLASS smoke;
VAR low;
RUN;

The Logistic Transformation

PROC LOGISTIC

PROC LOGISTIC DATA = lbw DESCENDING;
    MODEL low = smoke / RL;
    RUN; 
    QUIT;

why exponentiation results in odds ratios

some options for proc logistic

PROC LOGISTIC DATA = work.data DESCENDING;
    CLASS categorical.var / param=ref ref=first;
    *CLASS categorical.var (ref="rich") / param=ref ref=first;
    MODEL dichot.outcome = categorical.var num.var / RL;
    UNITS num.var=SD
RUN; QUIT;