Data Mining

A 1 day workshop

Given at SSC/IMS/WNAR 2001, IISA 2002, as an IIQP public course (Advanced Data Mining).
I also gave half of a similar workshop on Statistical Learning and Data Mining
for the SSC Biostatistics Section, at SSC2004, with Rob Tibshirani.

Hugh Chipman

University of Waterloo

About this picture

``Using data mining, businesses and organizations have discovered hidden knowledge in large databases that yield new customers, greater efficiency, and enormous profits.'' Is this fact or advertising hype? A bit of both? This one day workshop will introduce and critically examine data mining techniques using real case studies, such as the data mining case study from the 2000 SSC meeting.

Data mining problems are characterized by large amounts of data (both in the number of observations and the number of variables recorded on each observation), application of flexible modelling techniques, and intensive computation. Research fields involved in data mining include statistics, machine learning (artificial intelligence), and databases. From a statistical perspective, many data mining tools could be described as flexible models and methods for exploratory data analysis.

For example, suppose a direct marketing company has available a database of potential customers and information on whether they responded to a direct mailing campaign. Data mining techniques would answer such questions as ``What customers are most likely to respond to a mailing?'', ``Are there groups (or segments) of customers with similar characteristics or behaviour?'', and ``Are there interesting relationships between customer characteristics?'' While some techniques such as logistic regression or hierarchical clustering will probably be familiar to many statisticians, others like neural networks, decision trees, boosting, and interactive dynamic graphics will also be discussed.

By being vendor-neutral, the presentation will separate the hype from the facts. Data mining will be related to to recent and possible future research in statistics and computer science.

Course Outline

The workshop will include examples using the R statistical computing environment and xgobi multivariate data visualization package. Both these packages are freely available on the internet.

Introduction

Introductory examples and application areas
Data mining jargon
Databases
Data Collection

Data Preparation

Data cleaning: Outliers, missing values, etc
Simple summaries

Graphical exploration of data

Effective graphical techniques
Visualizing low and high dimensional data with static plots (projections, smoothing, density estimation, methods of encoding information)
Dynamic and interactive graphics
Sampling for large datasets and informal hypothesis testing.

Prediction and Classification

Linear models (regression, logistic regression)
K-nearest neighbours
Neural Nets
Trees
Multitudes of other models, and basis functions
Multiple models: methods for generating (Bagging, Boosting, Bayesian), selecting and combining predictive models.
Comparing classifiers: Gain/Lift charts, misclassification rate
Controlling model complexity

Clustering

Distance based methods
Mixture models
Graphical techniques
Scalability to large problems

Dependency modeling

Graphical models
Association rules

About the Speaker

Hugh Chipman is an associate professor and Canada Research Chair in Mathematical Modelling in the Department of Mathematics and Statistics at Acadia University. He received his PhD in statistics from Waterloo in 1994, and was an assistant professor at the University of Chicago Graduate School of Business from 1994-97 and an assistant and associate professor at the University of Waterloo before returning to Acadia in 2004.

His research interests include tree models, variable selection in linear models, data mining, Bayesian methods and industrial statistics. He has taught several introductory courses on data mining to people in industry through the Institute for the Improvement in Quality and Productivity and also teaches a graduate level course on topics related to data mining and the computational exploration of data.

Last updated November 10, 2004