An Introduction to Data Management and Cleaning for Analysis

Tuesday, April 4, 2017

10:00am - 12:00pm PST | Online

This webinar series contains six, two-hour sessions delivered from 10:00am - 12:00pm PST each session.

Session 1: Tue April 4 | Session 2: Thurs April 6
Session 3: Tues April 11 | Session 4: Thurs April 13
Session 5: Tues April 18 | Session 6: Thurs April 20

Overview

This webinar series provides an overview of basic data management and data cleaning techniques using SAS software.

In taking this course, you will learn how to develop a systematic approach to managing and cleaning your data for statistical analyses. This approach involves understanding the big picture first, and then developing a strategy for translating the big picture into concrete problem-solving steps. The workflow involved in these steps will be illustrated using a synthesized administrative data set and honed through a variety of applied exercises. During the course, you will be provided with access to a variety of practical tools that will ensure you will develop a sustainable and effective workflow for all of your future data analysis projects: SAS code, case studies, web resources and more. The overall goal of the course is to give you the conceptual and practical tools you need to handle your data preparation needs with confidence.

Homework activities will be provided for practice between sessions.

Prior required knowledge

Participants will be expected to have familiarity with the use of Administrative Data, basic knowledge of SAS functions (i.e.: descriptive statistics, merging and sorting) and an understanding of logistic regression.

Webinar objectives

By the end of this webinar series, participants will be able to:

Identify key types of data errors commonly found in the use of administrative data
Address and correct data errors using a systematic process
Subset, filter and aggregate data in preparation for statistical analyses;
Define the role of key variables for statistical analyses;
Recode qualitative variables as required
Transform quantitative variables as required

Webinar-based course format

The interactive webinar software will provide remote access for students to view the instructor's screen, listen to the lecture in real time, and ask questions. The instructors will provide lecture slides (PowerPoint) for pre-reading prior to the start of the webinar. A training dataset with associated SAS code will be provided both for webinar demonstrations and homework practice activities.

Course texts and pre-webinar reading

A recommended text for this course is: Cody’s Data Cleaning Techniques Using SAS, by Ron Cody

Instructor biography

Brandon Wagar, PhD, is the Director of Clinical Analytics for Island Health, and an Adjunct Assistant Professor at the University of Victoria, School of Health Information Science. Previously, he was a Methodologist at the Canadian Institute for Health Information (CIHI) for eight years. He received his PhD in Behavioural Neuroscience from the University of Waterloo, and completed a post-doctoral fellowship in Cognition and Brain Sciences at the University of Victoria. Brandon developed and has taught “From Data to Meaningful Information: Tools and Techniques for working with Large Healthcare Datasets” within the Health Information Science Master's program offered by the University of Victoria.

Workshop fees

Regular Rate: $295
Student Rate: $195

Course content

Session 1: Understand the 'big picture' and know your research variables

Introduction
- Course outline
- Homework assignments
Understand the 'big picture'
- What is the question I am asking?
- What analysis will I use to answer the question?
Know your research variables
- Identify the variables of interest for the statistical analyses
- Determine the role each variable will play in the statistical analyses (e.g., dependent variable, independent variable, grouping variable)
- Decide how each variable will be treated (e.g., nominal, ordinal, continuous, numeric, date, identifier) and, if applicable, determine level of the grouping data structure
- Identify whether the variables of interest are already available in the data or need to be created or recoded

Session 2: Bring all your variables together under the same roof: Building a study data set

Take stock of all the data sets involved in the research project
- Can I use the data I have, or must I recode?
- Do I need to link data from other sources? If so what will my linkage key(s) be?
Examine each data set briefly to understand what information it contains, why that information was collected and how it fits into the 'big picture'
Determine whether the data exhibit any natural grouping (e.g., patients nested in medical clinics, patients from different health authorities)
Extract the relevant portions from each data set and merge the appropriate portions together to bring all of the available variables under the same roof
- Are there data elements that I don’t need to include in my analysis & so can discard?
Perform simple linkage techniques

Session 3: Data cleaning/screening, diagnosing and editing

Examine the distribution of each variable using both visual and numerical means
Find “errors” in the data (i.e.: improperly coded data values, implausible or invalid data values, out of range data values, duplicated data values)
Rectify the “errors” found in the data
Make a mental note of the distribution of the “error”-free data (e.g., is the distribution normal; if not, does it exhibit skewedness, multi-modality, gaps, and outliers?)

Session 4: Date and time manipulation and sequencing

Work with variables recorded at different time scales (e.g., hours, day, month, year);
Perform various operations involving date and time variables (e.g., computing the time between two dates, computing the number of events that occurred within a specified time interval)

Session 5: Transformation, grouping, deriving

Create new (numeric) variables by transforming available (numeric) variables using transformations such as log, square root, etc.
In set where the data exhibits a natural grouping structure (e.g., patients nested in medical clinics, patients from different health authorities), create new variables by aggregating information across the groups present in the data.
Cluster variables (e.g., hospitals, clinics,etc) that need to be modelled as random variables
Create categorical variables from continuous variables

Session 6: Analysis and interpretation

Making comparisons
- Standardization and risk adjustment
Understanding limitations
Interpreting results

Search