Area Under the Curve

Calculating the Area Under the Curve is used in many different fields of research and departments here at the University of Guelph.  The example we will work through here, is one from Food Science.  A PDF version of the Powerpoint presentation used for this session can be viewed here.

The presentation explains the process of calculating the AUC (Area Under the Curve), so I will concentrate on the SAS coding here.

Step 1:  entering the data into SAS.  The sample data used in this session is the value of adhesiveness for a food sample, and it was measured at 6 time points.  The dataset is called adhesive, and has 2 variables:  X – which is the time point at which the measure was taken, and ADHESIVE – which is the value of adhesiveness at the matching X timepoint.

Proc print – was used to view the data and to ensure that it was entered correctly into SAS.

Step2:  Creates the plot.  It’s often easier to understand the process of any statistical analysis if you can visualize it.  In this example, we will plot the data to show you the individual areas that SAS will calculate the area of the trapezoid – as explained in the Powerpoint presentation.

SAS graphing features can be tricky and a lot of fun!  So let’s start with the Proc Gplot and work our way backwards.

Proc gplot data=adhesive;   Calling on the GPLOT procedure and using the dataset we created earlier called adhesive

plot   – now we are telling SAS what plots we want to create.  Notice that in this line we are telling SAS to create the plot of Adhesive*X  (Y by X) and we’re doing it twice.  NO, this is not a typo – we are telling SAS that we want to plot the data twice.  I’ll explain this a bit more in a moment.

/ overlay  frame cvref=blue vminor=w hminor=0;  These are all options for the plot.

overlay:  put the 2 plots on top of each other
frame:  put a frame or box around the entire graph
cvref=blue: create a horizontal reference line (by default this will be set to 0) and make it blue
vminor and hminor = 0:  We do not want minor ticks on either the vertical or horizontal axes

So, back to the 2 Adhesive*X plots.  Notice before the Proc GPLOT is called – we have two lines of code, one referring to symbol1 and a second referring to symbol2.  Each of these lines refers to one of the plots.  SYMBOL1 sets out all the characteristics for the first Adhesive*X plot – so we have:

v=none:  value = none – which means we do not want anything displayed for the datapoint
c=green: We want the colour associated with this first plot to be green
i = needle:  we want a needle or a straight vertical line from the x-axis to the datapoint – note that these will be green.

For SYMBOL2

v=dot:  We now want a dot at each datapoint for the second Adhesive*X plot
c=red: We want the dots to be red
i = join:  draw a straight line joining each of the datapoints.

Now take a few moments to review the resulting graph.  Can you find all the features listed above?

Step 3: Now the fun part – the calculation piece.

Remember that whenever we are working with the data in SAS, we need to work within a DATA Step.

Data adhesive (keep= X Adhesive TrapezoidAdhesive SumTrapezoidAdhesive);
   set adhesive;
   lagtime=lag(X);
   lagvalue=lag(Adhesive);
   if X = 0 then do;
      lagtime = 0;
      lagvalue = 0;
   end;
   TrapezoidAdhesive = (X – lagtime) * (Adhesive + Lagvalue)/2;
   SumTrapezoidAdhesive + TrapezoidAdhesive;
Run;

Data adhesive (keep= X Adhesive TrapezoidAdhesive SumTrapezoidAdhesive);  – Remember SAS likes us to save the dataset before we make any changes – so we are calling our revised dataset by the same name – adhesive – and we are only keeping the 4 variables listed in the keep part.

   set adhesive;  We already read the data in, so we are telling SAS to use a dataset called Adhesive

   lagtime=lag(X);
   lagvalue=lag(Adhesive);  In a previous SASsyFridays session we talked about the lag() function.  WE are creating a new variable called lagtime – which has the previous value of X that SAS has read in.  Same with lagvalue.

 if X = 0 then do;
      lagtime = 0;
      lagvalue = 0;
   end;    This piece of code only pertains to the first time we run through the program.  When we read the first value of X which would be 1, then we ask it for the one previous to that, which is 0, SAS will complain when we start doing calculations for all the variables, so we set out starting values of 0 for our lagtime and for our lagvalue.

   TrapezoidAdhesive = (X – lagtime) * (Adhesive + Lagvalue)/2;  Here is the calculation for the Trapezoid

   SumTrapezoidAdhesive + TrapezoidAdhesive;    Calculating the sum of all the parts.

Then we finish the DATA step with a Run; and we print off the results.  You will notice that one column shows you the area fore each trapezoid while the last column is the running total.

Macro

This is a really slick piece of coding, but what if you have several variables you want to use this code with.  Your options are to copy and paste as many times as you have variables, change the name of the variable and run over and over, or create a macro to do this for you.

We talked about creating macros for repetitive operations in the past blog post, but I wanted to show you how we did this for this coding as well.

/* This is the macro code – you run this once to ensure that the macro is stored
in the SAS session that you are running */

%macro Auc(dataset, out, FoodAttribute);   Name of the macro is AUC and it needs 3 pieces of information to run:  a dataset name, the name of the output dataset, and the name of the variable being tested.

symbol1 v=none c=green i=needle;
symbol2 v=dot c=red i=join;

title “Results for &FoodAttribute”;  the &FoodAttribute – is whatever value you provided when you called on the macro to run

Proc gplot data=&dataset;    same here you have &dataset – so this is the name of the dataset you listed when you called on the macro to run

plot &FoodAttribute*X &FoodAttribute*X / overlay
frame
cvref=blue
vminor= 0 hminor=0;
Run;
Quit;

Data &out (keep= X &FoodAttribute Trapezoid&FoodAttribute SumTrapezoid&FoodAttribute );
set &dataset;
lagtime = lag(X);
lagvalue = lag(&FoodAttribute);
if X = 0 then do;
lagtime = 0;
lagvalue = 0;
end;
trapezoid&FoodAttribute = (X – lagtime) *(&FoodAttribute + Lagvalue)/2;
SumTrapezoid&FoodAttribute + Trapezoid&FoodAttribute;
Run;

Proc print data=&out;
Run;
%mend;

Notice that all the code we talked about in this blog post is all listed inside the macro.  We close the macro with the %mend statement.

To run the macro we do the following:

%AUC(sensory, Adhesive_auc, Adhesive);  %AUC is telling SAS we are using the macro called AUC (which we just created and ran above),  Inside the brackets we match the pieces of information that it needs to run:  name of the incoming dataset (sensory), name of the output dataset (Adhesive_auc), and the name of the attribute or variable we are studying in this example.

We have several to run – so you can see the changes we make to this line of code that will run the macro.  To run it, simply highlight the line and click your friendly running dude.

%AUC(sensory, Cohesive_auc, Cohesive);
%AUC(sensory, Dense_auc, Dense);
%AUC(sensory, Grainy_auc, Grainy);

Conclusion

I would like to sincerely thank Madhu Sharma for bringing this question to me and for willing to put together the presentation for the SASsyFridays blog.

Screen Shot 2013-11-18 at 7.33.07 PM

Path Analysis

We talked about Path Analysis a couple of years ago, and today we’ll work through an example in SAS using PROC CALIS.  To date, we’ve had a couple of researchers using this PROC to review their data, which is very exciting!

Path analysis is a really cool analysis that I suspect may be underused.  Why?  folks are not aware of it, and it just sounds too far beyond the scope of some projects.  I am going to walk through the classic education example for this session/post, but at the end I will talk briefly about 2 research projects that are using Path analysis and briefly talk about their approaches, in an attempt to show you how this can be used in our agricultural fields.

Let’s start with our dataset.  We are working with a dataset that contains grades of 200 students, for writing, reading, math, and science.  We want to investigate the relationships among these variables.  We have heard anecdotally, that students who do well in MATH, will do well in SCIENCE.  With a path analysis, we can dig into this a bit.  The data we will be using was obtained from the UCLA Institute for Digital Research and Education.

Let’s review the powerpoint slides I created for this session and the annotated PDF output file.

Thank-you to the researchers that provided images of their path diagrams to use in this presentation.

Screen Shot 2013-11-18 at 7.33.07 PM

More than just Statistics – Can we Map in SAS?

PDF copy of the SAS syntax file used for this session.

We say SAS – we think statistics!  We think statistics – we say SAS!  We think maps and GIS – SAS is probably the last program you think to use.  There are so many other options out there, why would you use SAS?  Well…  believe it or not, but you can create basic maps with values in SAS.  Unfortunately, if you are using University Edition – SAS Studio – you do not have access to the PROCedures needed to create maps in SAS.

This blog post is based on a workshop delivered at the SUGI conference in 2004.  The data has been updated to use the 2016 and 2011 Canadian Census of Population.  Let’s get started and create 3 maps to demonstrate some of the options available in the PROC GMAP.

First thing that you need to know, is that SAS has a large selection of maps built-in.  If you browse the SAS system, you will see a library called MAPS.  If you browse in this library – you will see a large selection of maps from around the world.  We will take advantage of these in our session.

First map – using the choropleth

We will use the population of each province – with the exception of Nunavut, and create a choropleth map showing the distribution of population in 2016.

/* Reset all graphics options to default values */
goptions reset=all;

/*  Create a choropleth map with a tite  */
Proc gmap data=canada_pop map=maps.canada2;
  id province;
  choro pop_2016;
  title “Canadian Population by Province, 2016”;
  title2 “Using the Choropleth”;
Run;
quit;

PRoc GMAP – calls on the GMAP procedure, also note that you need to specify the dataset that you are using – contains the population – and the map what you will be applying the data to.  MAPS.CANADA2 is one of the internal maps created in SAS.

The ID statements is linking the variable called ID in the dataset and on the map.  The values of the ID variables have to be the same in both the data and the map.  We know that SAS used the province IDs that Statistics Canada uses.  To determine what the ID variable is on the MAPS – in SAS – open the datafile associated with the map you will be using.

CHORO – is calling on the GMAP procedure to add the 2016 population onto the map.  Without any options on the CHORO line, SAS will create the groupings – 4 by default on our map.

Second map – displaying prisms

Our second map will display prism.  This is a 3-D view of the relationships between our groups.  A larger value of our data will display a higher cutout.

/* reset all graphics options to default values */
goptions reset=all;

/* Create a prism map with a title and footnote */

Proc gmap data=canada_pop map=maps.canada2;
    id province;
    prism pop_2016;
    title “Canadian Population by Province, 2016”;
    title2 “Using Prisms”;
Run;
Quit;

Now we will use the PRISM option rather than CHORO.  Try it out and see what the results are.  Here is a link to a resulting map in PDF form.

Third map – using blocks

Our third map will display a block in each province that depicts the population.  The height of the block relates directly to the value of the population.

/* Reset all graphics options to default values */
goptions reset=all;

/* Create a block map with a title and a footnote */
Proc gmap data=canada_pop map=maps.canada2;
    id province;
    block pop_2016;
    title “Canadian Population by Province, 2016”;
    title2 “Using Blocks”;
Run;
Quit;

Changing the PRISM line to BLOCK in this example will give us our third map.

Conclusion

This was a brief overview of how we can incorporate mapping features into SAS.  We have only seen the very basic maps available in PROC GMAP.  We tend to use other GIS mapping software packages to create GIS maps, but be aware that we can use SAS.  Another option available with PC-SAS, is the GIS ESRI bridge, this option provides a bridge between one of the most popular GIS programs, ESRI, and SAS.

Screen Shot 2013-11-18 at 7.33.07 PM

 

Calculating differences using DIF( ) and LAG( ) functions

This may be a short post but I think one that folks will find may come in handy.  I was recently asked how to calculate a difference of a variable listed in 2 rows.  As a made-up example.  I have 2 weight observations on some unit and I want to calculate the difference between the weights.  Old school me thought oh boys – I’ll have to transpose the data and then calculate the difference and then re-create the dataset in the same original format.  Way too much work!!

DIF Function

Turns out I was missing this really cool DATA function called “diff”.  Who knew it could be that simple.  So let’s work through an example using the DIFF function and then we’ll work through it again using a different DATA function called “lag”.

Dataset to play with can be downloaded here.

The Proc sort is used to sort the data so that we have the 2 weight measures we want to calculate the difference between  beneath themselves.  In this case we want to calculate the difference between Time2 and Time1 for each ID Breed combination.

Proc sort data=wtdiff; 
  by id breed time;
Run;

Then we will create a new dataset called “test” and by using the BY statement in the DATA step we are telling SAS that we want to read the data to be read by the sorted ID Breed Time variables.  The first.time  tells SAS to look at the line of data of the first ID Breed Time combination then calculate the difference between the next time and this line – dif(varname) command.

Data test; 
    set wtdiff; 
    by id breed time; 
    if first.time then wtdiff = dif(weight);
Run;

Note that because the ID Breed Time changes every line that the first.time really isn’t doing anything.  You can remove this part of the statement and simply create the new variable  wtdiff = dif(weight); and that will work as well.  I’m providing this piece of code for situations when you will have more than 1 observation per grouping.

I then like to use Proc Print to see whether we were successful.

Proc print data=test;
Run;

You’ll notice that in the output dataset we now have a weight difference calculation for each set of lines except the first one – which makes sense – there is nothing for it to subtract from.  Now, our goal was to only have the difference for each ID Breed combination – so we need to delete those observations where the difference makes no sense.  If you look closely at the dataset, because of the sorting we did, we want to set the new variable wtdiff to a . whenever the time = 1.

Data test; 
    set wtdiff; 
    by id breed time; 
    if first.time then wtdiff = dif(weight);
if time = 1 then wtdiff = .;

Run;

Run a quick Proc Print to confirm the results.

Much, much quicker than transposing the data, calculating the difference and then reforming the data 🙂

LAG Function

Another way to accomplish the same task is to use the LAG function available in the DATA step.  LAG(varname) will list the last observation for that variable.  Best way to explain this is to see it in action.

Data test;
    set wtdiff;
    by id breed time;
    lastwt = lag(weight);
    secondlastwt = lag2(weight);
Run;

lastwt = lag(weight) provides you with the last weight measure, whereas the secondlastwt = lag2(weight) goes back 2 datalines.  You can increase the lag# as much as you would like.

Proc print data=test;
Run;

So let’s drop the secondlastwt variable as this was more for demonstration purposes.  If you look at the new resulting dataset you can see that for every line where time = 2 – we want to calculate the difference between weight and lastwt to obtain the same wtdiff variable we did above using the DIF( ) function.

Data test;
    set wtdiff;
    by id breed time;
    lastwt = lag(weight); 
   if time = 2 then do;
      wtdiff = weight – lastwt;
      drop lastwt;
    end;
Run;

To accomplish this, I am using a DO Loop – if time =2 then I want you to do the following steps.  Create the variable wtdiff as the difference between weight and lastwt.  The second step is to drop or delete the variable called lastwt.  I then need to close the DO Loop and Run the Data step.

Of course, use the Proc Print to confirm the results.

Conclusion

Two ways to accomplish the same task using 2 different DATA Step functions  DIF( ) and LAG( ).

Which way do you prefer???

Screen Shot 2013-11-18 at 7.33.07 PM

Revisit: How do we use @ and @@ to enter data

I’ve been reminded of how we can take advantage of the @ and @@ in SAS when entering data.  There have been a few students using this code, which I think is wonderful, and something I’d like to revisit and maybe remind folks on what these symbols do and how we can use them.

We will revisit a previous session we did a number of years ago on this very topic.  Hope to see you next Friday, February 16 to discuss Tips and Tricks: @ and @@

Follow-up Question and Code

A great situation for demonstration purposes was suggested in the session today.  Using the sheep dataset we have, but now we have measures for 2 weeks:

CageID  Trmt  A1_wk1  A2_wk1  A3_wk1  A4_wk1  A1_wk2  A2_wk2 A3_wk2 A4_wk2
101        DietA       5            8              9             6            12           13          14          12

Can we use SAS to read this data into a dataset that would show:
CageID, Trmt, Week, Animal, Fibre

Try this out!  A do-loop within a do-loop all within an INPUT statement.  See the placement of the “@”

Data fibretrial;
input CageID Treatment $ @;
do week = 1 to 2;
do animal = 1 to 4;
input fibre @;
output;
end;
end;
datalines;
101 DietA 5 8 9 6 12 13 14 12
102 DietA 6 7 8 5 12 14 15 21
201 DietB 10 9 11 14 22 21 24 25
202 DietB 15 12 11 10 23 25 26 27
;
Run;

 

Screen Shot 2013-11-18 at 7.33.07 PM

Revisit Subsetting in SAS

With new students learning SAS on campus, sometimes it is handy to revisit some of the posts we created a few years ago.  One such post is Subsetting data with the IF and WHERE statement.  In this week’s SAS workshop we’re looking at Merging data and creating new variables, so revisiting Subsetting is a nice fit.

Hope to see you all on Friday, Feb 2, in Crop Science Rm 121A at 12noon.

Screen Shot 2013-11-18 at 7.33.07 PM

 

ODS Excel – saving results directly into an Excel spreadsheet

I think it is very safe for me to say that most of us use Excel to enter our data.  We may need to  massage it to bring it into SAS for analysis, but that’s what we do.

So now let’s sit back and think….  how many times have you wanted to move your analysis results back into Excel?  Have you ever re-entered some SAS analysis results back into Excel?  Maybe to plot out your results? Or maybe to create a table of your results?  I suspect many of us have done this at one time or another.  Let me show you a quick way to accomplish this, without re-typing all those numbers!

ODS – Output Delivery System – we’ve seen this before.  You may have used it to save your output as a PDF file, or as a Word (RTF) file.  Well, there is also an Excel option here.  The general form of the statement is:

ods excel file=”   ” options(….);

ods excel close;

Looks a littler familiar right?  Let’s work through an example to see how this works.  I am using a small dataset where I would like to calculate the means and standard errors for each treatment group in the data, save my results in an Excel spreadsheet so I can use them to create a bar graph in Excel.

Data excel_test;
input trmt$ height;
datalines;
a 14
a 15
a 12
a 15
b 22
b 27
b 28
b 25
c 14
c 15
c 16
c 18
d 31
d 42
d 24
d 29
;
Run;

The data contains 4 treatments. To calculate the means and standard errors by treatment I will used the following piece of code:

Proc means data=excel_test mean stderr;
class trmt;
var height;
Run;

For more information on building this code, please review Proc MEANS blogpost

When I run this syntax I will get a table with the means and standard errors for each of the 4 treatments – perfect! Now, this is not a lot of information, so I could go ahead and retype these numbers in a blank Excel spreadsheet, but I am not a great typer, so let’s add the ODS Excel code.

ods excel file=”C:\Users\edwardsm\Documents\Workshops\SASsyFridays\20180119\height_means.xlsx” options(
sheet_name=”Summary”
embedded_titles=”yes”);

Proc means data=excel_test mean stderr;
class trmt;
var height;
title “Height means and standard errors”;
run;

ods excel close;

The Proc MEANS code is the same as above – no changes here! We simply tell SAS that we want the results of our Proc MEANS to be saved as an Excel file, by wrapping the code with the ODS Excel and ODS Excel Close code.

ods excel file=”C:\Users\edwardsm\Documents\Workshops\SASsyFridays\20180119\height_means.xlsx” options(
sheet_name=”Summary”
embedded_titles=”yes”);

ods excel – tells SAS that we want to use Excel as our output format.
file=”C:\Users\edwardsm\Documents\Workshops\SASsyFridays\20180119\height_means.xlsx” – tells SAS where the file should be saved and what to call it. Once I run this code, I should have a file called height_means.xlsx located in my Workshops\SASsyFridays\20180119 directory.

Then I have added a couple of options. Please note that there are a LOT of options you can use here, I have chosen to use the basic ones. We can do another session later in the year on other available options, if there is interest.

The first option is sheet_name = “Summary” – this is where I tell SAS what I want the Excel worksheet to be called. In this case I have selected Summary. This can be anything you want.

The second option is embedded_titles=”yes” – this allows SAS to send the titles from the Proc MEANS output to Excel. So you will see the title that I have provided for the Procedure as well as the titles from the Means output.

When you run this code, it will open Excel and the file that was created. It will also provide the output in the SAS Results Viewer.

After the Proc MEANS, I need to close the Excel file by adding the ODS Excel Close statement.

Now, you can use the information in the Excel file and create your graph or your table.

Screen Shot 2013-11-18 at 7.33.07 PM