Data Analytics and Statistics Track

Parallel Processing with Base SAS

Jim Barbour

Parallel processing (multi-threading) is a potentially powerful tool for SAS performance tuning. There’s a myth out there that one needs to license SAS MP Connect in order to conduct parallel processing. Not true! Parallel processing is eminently practicable using just the Base SAS product.

The simplest form of parallel processing relies on the use of SAS option statements alone. More complex parallel processing involves design changes and requires coding in support of those changes as well as the use of the SYSTASK and WAITFOR commands.

This paper will examine three examples of parallel processing: 1) The reduction in run time by a factor of eight for a specific example of Proc Means using just SAS options, 2) The running of multiple non-interdependent SAS procedures in parallel by means of launching subordinate threads, and 3) a comparatively more complex example of breaking a single DATA step into multiple threads in order to read a large data set (200+ million rows, 300+ columns) in less than a quarter of the single-threaded run time. Examples 2 and 3 will then be combined and used in a SAS process whose overall run time went from 9 hours to 3 hours.

All examples were run with SAS 9.4 in a multi-core UNIX environment. SAS Enterprise Guide 7.15 was used as the program editor and primary means of submission, but other editors and UNIX command line submission may be used in lieu of Enterprise Guide.

Using PROC MCMC in a Process Control Setting: An Illustrative Example with PROC IML

Austin Brown

When an experienced observer in a specific discipline examines some process or phenomenon, it is likely that they have prior knowledge about the characteristics and behaviors of what it is they are monitoring. This knowledge could be useful in calculating estimates for the parameters of this process. Leveraging the prior knowledge of parameter behavior along with observed data from the process or phenomenon being observed is the idea upon which Bayesian Estimation is based. In this presentation, attendees will be shown, by example, basic functionality of PROC MCMC to obtain Bayesian estimates, including examining diagnostics for the estimators. Attendees will also be shown an example of how to use Bayesian Estimation via PROC MCMC and PROC IML in a process control environment. This example will highlight some of the advantages of Bayesian Estimation over traditional Maximum Likelihood Estimation, especially in small sample size cases, as is often the case in process control settings.

Oops, I D-I-D It Again! Advanced Difference-in-Differences Models in SAS

Margaret Warton and Melissa Parker

The quasi-experimental, longitudinal difference-in-differences (D-I-D) study design is increasingly used in epidemiological and healthcare research. D-I-D models generate a causal estimate of the change in an outcome due to an intervention or exposure, after subtracting the expected background change observed in a reference group. Advantages of this method include preservation of time-ordering, accounting for changes in secular trends and regression to the mean, and, when a fixed cohort is used, controlling for unmeasured confounding.  In this companion to our 2016 Western Users of SAS conference paper “How D-I-D You Do That? Basic Difference-in-Differences Models in SAS®”, we introduce advanced difference-in-differences methods and present a practical, step by step approach for implementation in SAS®.  Topics covered include: power and sample size considerations, options for modeling a binary outcome, balancing case mix differences in exposed and reference groups, and assessing heterogeneity of treatment effects.  We will illustrate these methods using data from a study on the impact of introducing a value-based insurance design (VBID) medication plan at Kaiser Permanente Northern California on changes in medication adherence.

Non-parametric Analysis of the Variance in SAS

Hend Aljobaily

Comparing more than two groups has important applications within many fields. To analyze such an event, an analysis of the variance can be performed using ANOVA or MANOVA. However, these procedures have to assume normality of residuals. When researchers cannot determine the distribution of the response or cannot determine the parameters of the distribution, non-parametric methods would be used to perform necessary analyses. The Kruskal-Wallis test is a non-parametric test that can be used, when the normality of residuals cannot be assumed, to perform a one-way analysis of variance. This study is discussing methods of modeling non-parametric one-way analysis of variance (ANOVA and MANOVA) using different options in SAS®.

Analyzing international assessments: An ensemble and model comparison approach

Chong Ho Yu, Hyun Seo Lee, Siyan Gan and Emily Lara

Learners in Asian countries and religions are among the top performers in the Programme of International Assessment of Adult Competencies (PIAAC). In the past, numerous studies had been conducted to identify the predictors of their outstanding performances. However, this type of analysis is challenging due to the large sample size. To rectify the situation, this study utilized ensemble methods (bagging and boosting) in SAS and JMP to analyze these international assessment data. Bagging can minimize variance but may inflate bias whereas boosting can reduce bias and improve predictive power, but cannot control variance. In order to identify the best model, both methods were employed and different criteria were examined for the model selection.

'Training Wheels' for learning mixed models

YuTing Tian and Russ Lavery

This paper intends to provide insights, rather than math, as part of an introduction to mixed models. It's intended for someone just starting out to learn mixed models and wants to show the link between one way ANOVA, two-way ANOVA and mixed models.

Along the way, it is a good explanation of fixed effect ANOVA. There is a thread that links all of the sections of this paper together. That thread is the idea that error, or variability, is the inability of a process to repeat its mean.

Another deliverable of this paper is that it contains worked examples of the matrix calculations and, in the appendix, SAS code that displays the matrices that appear in typical explanations of mixed models.

Forecasting tourist's arrivals to the USA with SARIMA models.

Mostafa Zahedjahromi

The aims of this study are to identify a model best fitting the tourism in the USA data and to forecast the number of tourism entering the United States for 2018. The method of maximum likelihood was used to estimate the parameters and to forecast the number of tourism in the future. The data from 1998 to 2011, which have been documented annually by Rachel Passmore at Census School of New Zealand, reveal that the SARIMA(0,1,2)〖(0,1,1)〗_12 model may fit most adequately and forecast that the number of tourists who enter the U.S. will reach approximately 540,000  in 6 years, which explains 2.6 times increase, compared to the number of tourists entering the U.S. in 1998.

A latent variable location scale model for intensive longitudinal data

Shelley Blozis

Mixed-effects models give a flexible means to the analysis of longitudinal data. A location scale model, a special case of these models, is useful for intensive longitudinal data in which data are collected for many time points. The model characterizes between-subject variation in response levels over time, as well as within-subject variation in each person’s responses. One part of the model allows for the inclusion of person-level predictors of the between-subject variation in mean response levels to understand individual differences, such as individual differences in daily stress averaged across multiple days. A second part of the model allows for the inclusion of both time-specific and person-specific predictors of the within-subject variation. This provides a way to study why the responses of some individuals may fluctuate more widely relative to those of others. Current applications of the model assume that the responses are measured without error. This assumption is not tenable for some measures, including psychological variables for which the observed measures serve only as indicators of an underlying construct. This paper presents a location scale model based on a latent variable model to address measurement error in measured responses. PROC NLMIXED syntax is developed to estimate the model. An example illustrates how model interpretation can be improved by addressing the measurement error that is common to many measured variables in the social and behavioral sciences.

SAS Techniques to Handle Large Files And Reduce Execution times

Kaiqing Fan

As a SAS Developer or SAS user, we are always struggling with the long execution time of our SAS engines, sometimes it would be couple hours, some other times, it may be more than 20 or 30 hours, or longer. In my eyes, too long execution time is not acceptable. Actually we have many SAS technical skills, if you can use them properly, we can hugely shorten the execution time. I did it. I successfully shortened the execution time from 36 hours to around 1 hours; from around 3 hours to 6 minutes when executing lots of large and complex data files. Here I want to summarize most of the technical skills I used and share them with you.

When the Mean Isn't Enough: Methods for Assessing Individual Differences using SAS

Melissa McTernan

Many programs of research are focused on understanding the “average” individual, leading to the use of statistical methods that emphasize means (e.g. mean differences, mean trajectories, etc.). However, “typical” values are often insufficient and sometimes not representative of any one individual (or one group of people) in a sample. In this paper, I will discuss how SAS is a flexible tool for researchers interested in individual differences. I will include methods of data visualization as well as methods of statistical analysis when the individual is a unit of interest. I will primarily focus on longitudinal data analysis and how individuals change across time. This paper will include syntax for PROC MIXED, PROC NLMIXED, PROC GLIMMIX, PROC SGPLOT, and PROC SGPANEL..

Modeling Heterogeneous Within-Subject Variability

Madeline Craft and Shelley Blozis

The development of smart phones which can conveniently collect intensive longitudinal data, called Ecological Momentary Assessment (EMA) data, has allowed us to better explore dynamic within-subject (WS) and between-subject (BS) processes. Typically, multilevel models are applied to EMA data to account for the hierarchical nature of longitudinal data. Whereas multiple regression estimates a single average trajectory for all subjects, multilevel models account for the unique influence of an individual on his or her repeated measurements by estimating an average trajectory for each subject.

Multilevel models can be extended to not only summarize differences in average trajectories but also differences in variances about the average trajectories. This extension of the basic multilevel model, which has been called the location scale model in recent years (see Hedeker, Mermelstein & Demirtas, 2008, 2012; Rast, Hofer & Sparks, 2012), utilizes a log-linear representation of the residual error variance to capture the non-negative nature of variances. Although log-linear models of residual error variances have been around for some time (see Aitkin, 1987; Harvey, 1976), there has been resurging interest in these models due to the recent abundance of EMA data.

SAS® PROC MIXED can be used to estimate models of the WS variance, but SAS® PROC NLMIXED is necessary for models of the BS variance. In addition, SAS® PROC NLMIXED is necessary to fit a WS subject variance model with a random scale effect. Fitting the full location scale model in SAS® PROC NLMIXED is done in stages. This talk is intended for an audience with any level of SAS proficiency and some interest in longitudinal data methods, however it is expected that the audience has at least a basic knowledge of multilevel models. It will not be assumed that the audience has experience with SAS® PROC MIXED or NLMIXED.

Crime in the USA: Using SAS to Analyze Recidivism Rates

Philip Mayevskiy

This paper focuses on using the Annual Parole Survey (2014), produced by the Bureau of Justice, Statistics to analyze recidivism rates in the United States criminal justice system. The goal of this paper is to illustrate how to use SAS University Edition to analyze the Annual Parole Survey dataset and interpret the results to make accurate conclusions. Though this paper focuses on using statistical analysis to understand recidivism rates in the United States, the techniques applied are widely applicable to various other government statistics.

Agreeing to Disagree: Using SAS to Make Reasoned Decisions When Information Criteria Select Different Models

Wendy Christensen

Information criteria are a useful and flexible tool for model selection when the null hypothesis testing framework is unsuitable or undesirable. Over 30 procedures in SAS/STAT 14.3 provide information criteria for fitted models automatically. Typically, the model with the lowest information criterion value is considered the “best” model, the model with the second-lowest value is considered the “second-best” model, and so on. In practice, many statistical modelers use multiple information criteria (e.g. AIC and BIC) to select among a set of candidate models. It is possible, however, for different information criteria to rank models differently, potentially leading to seemingly conflicting results. The focus of this breakout session is twofold. First, I provide a conceptual overview of why differences in model ranking occur and describe practical approaches to resolving disagreements. Second, I demonstrate how to use Base SAS functions to identify if ranking differences have occurred among a set of fitted candidate models and to compute useful summary statistics to assist with making a well-reasoned decision about which model to select.

Machine Learning - Why we should know it and how it works

Kevin Lee

The most popular buzz word nowadays in the technology world is “Machine Learning (ML).”  Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes such as: self-driving vehicles; online recommendation on Netflix and Amazon; fraud detection in banks; image and video recognition; natural language processing; question answering machines (e.g., IBM Watson); and many more.  This is leading many organizations to seek experts who can implement Machine Learning into their businesses.

Statistical programmers and statisticians in the pharmaceutical industry are in very interesting positions. We have very similar backgrounds as Machine Learning experts, such as programming, statistics, and data expertise, thus embodying the essential technical skill sets needed. This similarity leads many individuals to ask us about Machine Learning. If you are the leaders of biometric groups, you get asked more often.

The paper is intended for statistical programmers and statisticians who are interested in learning and applying Machine Learning to lead innovation in the pharmaceutical industry. The paper will start with the introduction of basic concepts of Machine Learning - hypothesis and cost function and gradient descent.  Then, paper will introduce Supervised ML (e.g., Support Vector Machine, Decision Trees, Logistic Regression), Unsupervised ML (e.g., clustering) and the most powerful ML algorithm, Artificial Neural Network (ANN).  The paper will also introduce some of popular SAS ® ML procedures and SAS Visual Data Mining and Machine Learning. Finally, the paper will discuss the current ML implementation, its future implementation and how programmers and statisticians could lead this exciting and disruptive technology in pharmaceutical industry.

Markov Chains as a Predictive Analytic Approach Using SAS/IML

Gregory McKinney

As a predictive analytics approach, Markov Chains provide a powerful tool for modeling complex multi-state dynamic systems.  Because of this power, Markov Chain models have been used to address many real-world problems including disease progression, loan portfolio risk, measuring marketing campaign effectiveness, evaluation of clinical drug trials, etc.  The list of potential applications of Markov chains is endless.  SAS/IML provides a potent framework for implementing Markov chain models in your organization.  Sample code for the use of SAS/IML for Markov chain models is included.

Logistic and Linear Regression Assumptions: Violation Recognition and Control

Deanna Schreiber-Gregory

Regression analyses are one of the first steps (aside from data cleaning, preparation, and descriptive analyses) in any analytic plan, regardless of plan complexity. Therefore, it is worth acknowledging that the choice and implementation of the wrong type of regression model, or the violation of its assumptions, can have detrimental effects to the results and future directions of any analysis. Considering this, it is important to understand the assumptions of these models and be aware of the processes that can be utilized to test whether these assumptions are being violated. Given that logistic and linear regression techniques are two of the most popular types of regression models utilized today, these are the are the ones that will be covered in this paper. Some Logistic regression assumptions that will reviewed include: dependent variable structure, observation independence, absence of multicollinearity, linearity of independent variables and log odds, and large sample size. For Linear regression, the assumptions that will be reviewed include: linearity, multivariate normality, absence of multicollinearity and auto-correlation, homoscedasticity, and measurement level. This paper is intended for any level of SAS® user. This paper is also written to an audience with a background in theoretical and applied statistics, though the information within will be presented in such a way that any level of statistics/mathematical knowledge will be able to understand the content.

Regularization Techniques for Multicollinearity: Lasso, Ridge, and Elastic Nets

Deanna Schreiber-Gregory

Multicollinearity can be briefly described as the phenomenon in which two or more identified predictor variables are linearly related, or codependent. The presence of this phenomenon can have a negative impact on an analysis as a whole and can severely limit the conclusions of a research study. In this paper, we will briefly review how to detect multicollinearity, and once it is detected, which regularization techniques would be the most appropriate to combat it. The nuances and assumptions of R1 (Lasso), R2 (Ridge Regression), and Elastic Nets will be covered in order to provide adequate background for appropriate analytic implementation. This paper is intended for any level of SAS® user. This paper is also written to an audience with a background in theoretical and applied statistics, though the information within will be presented in such a way that any level of statistics/mathematical knowledge will be able to understand the content.

From FREQing Slow to FREQing Fast: Facilitating a Four-Times-Faster FREQ with Divide-and-Conquer Parallel Processing

Troy Hughes

With great fanfare, the release of SAS® 9 delivered multithreaded processing to a single-threaded SAS world. Procedures such as SORT, SQL, and MEANS could now run faster by taking advantage more fully of system resources through parallel processing paradigms. Multithreading commonly implements divide-and-conquer methodologies in which data sets or data streams are decomposed into subsets and processed in parallel rather than in series. Multithreaded solutions are faster (but typically not more efficient) than their single-threaded counterparts because execution time (but not system resource utilization) is decreased. As the costs of memory and processing power have continued to decrease, however, there remains no excuse for not implementing multithreaded processing wherever possible. To this end, and because SAS unfortunately abandoned some hapless procedures in single-threaded Sheol, this text aims to reunite the single-threaded FREQ procedure with its multithreaded bedfellows. The FREQFAST macro is introduced and espouses divide-and-conquer parallel processing that performs a frequency analysis more than four times faster than the out-of-the-box FREQ procedure. Non-environmental factors affecting FREQ performance (e.g., number of observations, number of unique observations, file size) are elucidated and modeled to demonstrate and predict performance improvement delivered through FREQFAST.

Time Series Analysis of Hate Speech in Social Media

David Corliss

Mining and analysis of social media text is a powerful tool for the analysis of thoughts, preferences, and actions of individuals and populations. While commonly used today in marketing and other business applications, Data for Good researchers have begun to apply these methods to the analysis of hate speech in social media. Different channels, including Twitter, Facebook, and Google searches, are found to have distinctive characteristics that affect the types of models and analyses each can support. This paper provides a step-by-step description of how to create and deploy a Twitter API and mine the data to extract tweets with user-selected search terms, including keywords and user name of the person or organization sending the tweet. This methodology is used to investigate hate speech, modeling the time series patterns with the aim of estimating the risk of subsequent acts of violence against persons targeted by the speech. These Data For Good analyses have been performed using SAS® University Edition, a free version of SAS® available to students, professors, and non-profit researchers.

Forecasting: Something Old, Something New

Dave Dickey

ARIMA (AutoRegressive Integrated Moving Average) models for data taken over time were popularized in the 1970s by Box and Jenkins in their famous book. SASTM software procedures PROC ESM (Exponential Smoothing Models) and PROC UCM (Unobserved Components Models which are a simple subset of statespace models – see PROC SSM) have become available much more recently than PROC ARIMA.  Not surprisingly, since ARIMA models are universal approximators for most reasonable time series, the models fit by these newer procedures are very closely related to ARIMA models. In this talk, some of these relationships are shown and several examples of the techniques are given.  At the end, the listener will find that there is something quite familiar about these seemingly new innovations in forecasting and will have more insights into how these methods work in practice.  The talk is meant to introduce the topics to anyone with some basic knowledge of ARIMA models and the examples should be of interest to anyone planning to analyze data taken over time.