Data Mining

A 1 day workshop


Given at SSC/IMS/WNAR 2001, IISA 2002, as an IIQP public course (Advanced Data Mining).
I also gave half of a similar workshop on Statistical Learning and Data Mining
for the SSC Biostatistics Section, at SSC2004, with Rob Tibshirani.

Hugh Chipman

University of Waterloo

Mapping the taste of beer

About this picture

``Using data mining, businesses and organizations have discovered hidden knowledge in large databases that yield new customers, greater efficiency, and enormous profits.'' Is this fact or advertising hype? A bit of both? This one day workshop will introduce and critically examine data mining techniques using real case studies, such as the data mining case study from the 2000 SSC meeting.

Data mining problems are characterized by large amounts of data (both in the number of observations and the number of variables recorded on each observation), application of flexible modelling techniques, and intensive computation. Research fields involved in data mining include statistics, machine learning (artificial intelligence), and databases. From a statistical perspective, many data mining tools could be described as flexible models and methods for exploratory data analysis.

For example, suppose a direct marketing company has available a database of potential customers and information on whether they responded to a direct mailing campaign. Data mining techniques would answer such questions as ``What customers are most likely to respond to a mailing?'', ``Are there groups (or segments) of customers with similar characteristics or behaviour?'', and ``Are there interesting relationships between customer characteristics?'' While some techniques such as logistic regression or hierarchical clustering will probably be familiar to many statisticians, others like neural networks, decision trees, boosting, and interactive dynamic graphics will also be discussed.

By being vendor-neutral, the presentation will separate the hype from the facts. Data mining will be related to to recent and possible future research in statistics and computer science.

Course Outline

The workshop will include examples using the R statistical computing environment and xgobi multivariate data visualization package. Both these packages are freely available on the internet.
  1. Introduction
  2. Data Preparation
  3. Graphical exploration of data
  4. Prediction and Classification
  5. Clustering
  6. Dependency modeling

About the Speaker

Hugh Chipman is an associate professor and Canada Research Chair in Mathematical Modelling in the Department of Mathematics and Statistics at Acadia University. He received his PhD in statistics from Waterloo in 1994, and was an assistant professor at the University of Chicago Graduate School of Business from 1994-97 and an assistant and associate professor at the University of Waterloo before returning to Acadia in 2004.

His research interests include tree models, variable selection in linear models, data mining, Bayesian methods and industrial statistics. He has taught several introductory courses on data mining to people in industry through the Institute for the Improvement in Quality and Productivity and also teaches a graduate level course on topics related to data mining and the computational exploration of data.



Last updated November 10, 2004