Data Mining for Business Analytics

Concepts, Techniques, and Applications in R

Galit Shmueli, Peter Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl

Die Leseprobe wird geladen.
Buch (gebundene Ausgabe, Englisch)
Buch (gebundene Ausgabe, Englisch)
106,99
106,99
inkl. gesetzl. MwSt.
inkl. gesetzl. MwSt.
Lieferbar innerhalb von 3 Wochen Versandkostenfrei
Lieferbar innerhalb von 3 Wochen
Versandkostenfrei

Weitere Formate

gebundene Ausgabe

ab 106,99 €

Accordion öffnen

eBook

ab 103,99 €

Accordion öffnen

Beschreibung

Data Mining for Business Analytics: Concepts, Techniques, and Applications in R presents an applied approach to data mining concepts and methods, using R software for illustration
Readers will learn how to implement a variety of popular data mining algorithms in R (a free and open-source software) to tackle business problems and opportunities.
This is the fifth version of this successful text, and the first using R. It covers both statistical and machine learning algorithms for prediction, classification, visualization, dimension reduction, recommender systems, clustering, text mining and network analysis. It also includes:
* Two new co-authors, Inbal Yahav and Casey Lichtendahl, who bring both expertise teaching business analytics courses using R, and data mining consulting experience in business and government
* Updates and new material based on feedback from instructors teaching MBA, undergraduate, diploma and executive courses, and from their students
* More than a dozen case studies demonstrating applications for the data mining techniques described
* End-of-chapter exercises that help readers gauge and expand their comprehension and competency of the material presented
* A companion website with more than two dozen data sets, and instructor materials including exercise solutions, PowerPoint slides, and case solutions
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R is an ideal textbook for graduate and upper-undergraduate level courses in data mining, predictive analytics, and business analytics. This new edition is also an excellent reference for analysts, researchers, and practitioners working with quantitative methods in the fields of business, finance, marketing, computer science, and information technology.
" This book has by far the most comprehensive review of business analytics methods that I have ever seen, covering everything from classical approaches such as linear and logistic regression, through to modern methods like neural networks, bagging and boosting, and even much more business specific procedures such as social network analysis and text mining. If not the bible, it is at the least a definitive manual on the subject."
Gareth M. James, University of Southern California and co-author (with Witten, Hastie and Tibshirani) of the best-selling book An Introduction to Statistical Learning, with Applications in R
Galit Shmueli, PhD, is Distinguished Professor at National Tsing Hua University's Institute of Service Science. She has designed and instructed data mining courses since 2004 at University of Maryland, Statistics.com, Indian School of Business, and National Tsing Hua University, Taiwan. Professor Shmueli is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored over 70 publications including books.
Peter C. Bruce is President and Founder of the Institute for Statistics Education at Statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective (Wiley) and co-author of Practical Statistics for Data Scientists: 50 Essential Concepts (O'Reilly).
Inbal Yahav, PhD, is Professor at the Graduate School of Business Administration at Bar-Ilan University, Israel. She teaches courses in social network analysis, advanced research methods, and software quality assurance. Dr. Yahav received her PhD in Operations Research and Data Mining from the University of Maryland, College Park.
Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad, for 15 years.
Kenneth C. Lichtendahl, Jr., PhD, is Associate Professor at the University of Virginia. He is the Eleanor F. and Phillip G. Rust Professor of Business Administration and teaches MBA courses in decision analysis, data analysis and optimization, and managerial quantitative analysis. He also teaches executive education courses in strategic analysis and decision-making, and managing the corporate aviation function.

Galit Shmueli, PhD, is Distinguished Professor at National Tsing Hua University's Institute of Service Science. She has designed and instructed data mining courses since 2004 at University of Maryland, Statistics.com, Indian School of Business, and National Tsing Hua University, Taiwan. Professor Shmueli is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored over 70 publications including books.
Peter C. Bruce is President and Founder of the Institute for Statistics Education at Statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective (Wiley) and co-author of Practical Statistics for Data Scientists: 50 Essential Concepts (O'Reilly).
Inbal Yahav, PhD, is Professor at the Graduate School of Business Administration at Bar-Ilan University, Israel. She teaches courses in social network analysis, advanced research methods, and software quality assurance. Dr. Yahav received her PhD in Operations Research and Data Mining from the University of Maryland, College Park.
Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad, for 15 years.
Kenneth C. Lichtendahl, Jr., PhD, is Associate Professor at the University of Virginia. He is the Eleanor F. and Phillip G. Rust Professor of Business Administration and teaches MBA courses in decision analysis, data analysis and optimization, and managerial quantitative analysis. He also teaches executive education courses in strategic analysis and decision-making, and managing the corporate aviation function.

Produktdetails

Einband gebundene Ausgabe
Erscheinungsdatum 05.09.2017
Verlag John Wiley & Sons Inc
Seitenzahl 576
Maße (L/B/H) 26,3/18,7/3,8 cm
Gewicht 1163 g
Auflage 1. Auflage
Sprache Englisch
ISBN 978-1-118-87936-8

Kundenbewertungen

Es wurden noch keine Bewertungen geschrieben.
  • artikelbild-0
  • Contents
    Foreword by Gareth James xix
    Foreword by Ravi Bapna xxi
    Preface to the R Edition xxiii
    Acknowledgments xxvii
    PART I PRELIMINARIES
    CHAPTER 1 Introduction 3
    1.1 What Is Business Analytics? 3
    1.2 What Is Data Mining? 5
    1.3 Data Mining and Related Terms 5
    1.4 Big Data 6
    1.5 Data Science 7
    1.6 Why Are There So Many Different Methods? 8
    1.7 Terminology and Notation 9
    1.8 Road Maps to This Book 11
    Order of Topics 11
    CHAPTER 2 Overview of the Data Mining Process 15
    2.1 Introduction 15
    2.2 Core Ideas in Data Mining 16
    Classification 16
    Prediction 16
    Association Rules and Recommendation Systems 16
    Predictive Analytics 17
    Data Reduction and Dimension Reduction 17
    Data Exploration and Visualization 17
    Supervised and Unsupervised Learning 18
    2.3 The Steps in Data Mining 19
    2.4 Preliminary Steps 21
    Organization of Datasets 21
    Predicting Home Values in the West Roxbury Neighborhood 21
    Loading and Looking at the Data in R 22
    Sampling from a Database 24
    Oversampling Rare Events in Classification Tasks 25
    Preprocessing and Cleaning the Data 26
    2.5 Predictive Power and Overfitting 33
    Overfitting 33
    Creation and Use of Data Partitions 35
    2.6 Building a Predictive Model 38
    Modeling Process 39
    2.7 Using R for Data Mining on a Local Machine 43
    2.8 Automating Data Mining Solutions 43
    Data Mining Software: The State of the Market (by Herb Edelstein) 45
    Problems 49
    PART II DATA EXPLORATION AND DIMENSION REDUCTION
    CHAPTER 3 Data Visualization 55
    3.1 Uses of Data Visualization 55
    Base R or ggplot? 57
    3.2 Data Examples 57
    Example 1: Boston Housing Data 57
    Example 2: Ridership on Amtrak Trains 59
    3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots 59
    Distribution Plots: Boxplots and Histograms 61
    Heatmaps: Visualizing Correlations and Missing Values 64
    3.4 Multidimensional Visualization 67
    Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 67
    Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 70
    Reference: Trend Lines and Labels 74
    Scaling up to Large Datasets 74
    Multivariate Plot: Parallel Coordinates Plot 75
    Interactive Visualization 77
    3.5 Specialized Visualizations 80
    Visualizing Networked Data 80
    Visualizing Hierarchical Data: Treemaps 82
    Visualizing Geographical Data: Map Charts 83
    3.6 Summary: Major Visualizations and Operations, by Data Mining Goal 86
    Prediction 86
    Classification 86
    Time Series Forecasting 86
    Unsupervised Learning 87
    Problems 88
    CHAPTER 4 Dimension Reduction 91
    4.1 Introduction 91
    4.2 Curse of Dimensionality 92
    4.3 Practical Considerations 92
    Example 1: House Prices in Boston 93
    4.4 Data Summaries 94
    Summary Statistics 94
    Aggregation and Pivot Tables 96
    4.5 Correlation Analysis 97
    4.6 Reducing the Number of Categories in Categorical Variables 99
    4.7 Converting a Categorical Variable to a Numerical Variable 99
    4.8 Principal Components Analysis 101
    Example 2: Breakfast Cereals 101
    Principal Components 106
    Normalizing the Data 107
    Using Principal Components for Classification and Prediction 109
    4.9 Dimension Reduction Using Regression Models 111
    4.10 Dimension Reduction Using Classification and Regression Trees 111
    Problems 112
    PART III PERFORMANCE EVALUATION
    CHAPTER 5 Evaluating Predictive Performance 117
    5.1 Introduction 117
    5.2 Evaluating Predictive Performance 118
    Naive Benchmark: The Average 118
    Prediction Accuracy Measures 119
    Comparing Training and Validation Performance 121
    Lift Chart 121
    5.3 Judging Classifier Performance 122
    Benchmark: The Naive Rule 124
    Class Separation 124
    The Confusion (Classification) Matrix 124
    Using the Validation Data 126
    Accuracy Measures 126
    Propensities and Cutoff for Classification 127
    Performance in Case of Unequal Importance of Classes 131
    Asymmetric Misclassification Costs 133
    Generalization to More Than Two Classes 135
    5.4 Judging Ranking Performance 136
    Lift Charts for Binary Data 136
    Decile Lift Charts 138
    Beyond Two Classes 139
    Lift Charts Incorporating Costs and Benefits 139
    Lift as a Function of Cutoff 140
    5.5 Oversampling 140
    Oversampling the Training Set 144
    Evaluating Model Performance Using a Non-oversampled Validation Set 144
    Evaluating Model Performance if Only Oversampled Validation Set Exists 144
    Problems 147
    PART IV PREDICTION AND CLASSIFICATION METHODS
    CHAPTER 6 Multiple Linear Regression 153
    6.1 Introduction 153
    6.2 Explanatory vs. Predictive Modeling 154
    6.3 Estimating the Regression Equation and Prediction 156
    Example: Predicting the Price of Used Toyota Corolla Cars 156
    6.4 Variable Selection in Linear Regression 161
    Reducing the Number of Predictors 161
    How to Reduce the Number of Predictors 162
    Problems 169
    CHAPTER 7 k-Nearest Neighbors (kNN) 173
    7.1 The k-NN Classifier (Categorical Outcome) 173
    Determining Neighbors 173
    Classification Rule 174
    Example: Riding Mowers 175
    Choosing k 176
    Setting the Cutoff Value 179
    k-NN with More Than Two Classes 180
    Converting Categorical Variables to Binary Dummies 180
    7.2 k-NN for a Numerical Outcome 180
    7.3 Advantages and Shortcomings of k-NN Algorithms 182
    Problems 184
    CHAPTER 8 The Naive Bayes Classifier 187
    8.1 Introduction 187
    Cutoff Probability Method 188
    Conditional Probability 188
    Example 1: Predicting Fraudulent Financial Reporting 188
    8.2 Applying the Full (Exact) Bayesian Classifier 189
    Using the "Assign to the Most Probable Class" Method 190
    Using the Cutoff Probability Method 190
    Practical Difficulty with the Complete (Exact) Bayes Procedure 190
    Solution: Naive Bayes 191
    The Naive Bayes Assumption of Conditional Independence 192
    Using the Cutoff Probability Method 192
    Example 2: Predicting Fraudulent Financial Reports, Two Predictors 193
    Example 3: Predicting Delayed Flights 194
    8.3 Advantages and Shortcomings of the Naive Bayes Classifier 199
    Problems 202
    CHAPTER 9 Classification and Regression Trees 205
    9.1 Introduction 205
    9.2 Classification Trees 207
    Recursive Partitioning 207
    Example 1: Riding Mowers 207
    Measures of Impurity 210
    Tree Structure 214
    Classifying a New Record 214
    9.3 Evaluating the Performance of a Classification Tree 215
    Example 2: Acceptance of Personal Loan 215
    9.4 Avoiding Overfitting 216
    Stopping Tree Growth: Conditional Inference Trees 221
    Pruning the Tree 222
    Cross-Validation 222
    Best-Pruned Tree 224
    9.5 Classification Rules from Trees 226
    9.6 Classification Trees for More Than Two Classes 227
    9.7 Regression Trees 227
    Prediction 228
    Measuring Impurity 228
    Evaluating Performance 229
    9.8 Improving Prediction: Random Forests and Boosted Trees 229
    Random Forests 229
    Boosted Trees 231
    9.9 Advantages and Weaknesses of a Tree 232
    Problems 234
    CHAPTER 10 Logistic Regression 237
    10.1 Introduction 237
    10.2 The Logistic Regression Model 239
    10.3 Example: Acceptance of Personal Loan 240
    Model with a Single Predictor 241
    Estimating the Logistic Model from Data: Computing Parameter Estimates 243
    Interpreting Results in Terms of Odds (for a Profiling Goal) 244
    10.4 Evaluating Classification Performance 247
    Variable Selection 248
    10.5 Example of Complete Analysis: Predicting Delayed Flights 250
    Data Preprocessing 251
    Model-Fitting and Estimation 254
    Model Interpretation 254
    Model Performance 254
    Variable Selection 257
    10.6 Appendix: Logistic Regression for Profiling 259
    Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome 259
    Appendix B: Evaluating Explanatory Power 261
    Appendix C: Logistic Regression for More Than Two Classes 264
    Problems 268
    CHAPTER 11 Neural Nets 271
    11.1 Introduction 271
    11.2 Concept and Structure of a Neural Network 272
    11.3 Fitting a Network to Data 273
    Example 1: Tiny Dataset 273
    Computing Output of Nodes 274
    Preprocessing the Data 277
    Training the Model 278
    Example 2: Classifying Accident Severity 282
    Avoiding Overfitting 283
    Using the Output for Prediction and Classification 283
    11.4 Required User Input 285
    11.5 Exploring the Relationship Between Predictors and Outcome 287
    11.6 Advantages and Weaknesses of Neural Networks 288
    Problems 290
    CHAPTER 12 Discriminant Analysis 293
    12.1 Introduction 293
    Example 1: Riding Mowers 294
    Example 2: Personal Loan Acceptance 294
    12.2 Distance of a Record from a Class 296
    12.3 Fisher's Linear Classification Functions 297
    12.4 Classification Performance of Discriminant Analysis 300
    12.5 Prior Probabilities 302
    12.6 Unequal Misclassification Costs 302
    12.7 Classifying More Than Two Classes 303
    Example 3: Medical Dispatch to Accident Scenes 303
    12.8 Advantages and Weaknesses 306
    Problems 307
    CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling 311
    13.1 Ensembles 311
    Why Ensembles Can Improve Predictive Power 312
    Simple Averaging 314
    Bagging 315
    Boosting 315
    Bagging and Boosting in R 315
    Advantages and Weaknesses of Ensembles 315
    13.2 Uplift (Persuasion) Modeling 317
    A-B Testing 318
    Uplift 318
    Gathering the Data 319
    A Simple Model 320
    Modeling Individual Uplift 321
    Computing Uplift with R 322
    Using the Results of an Uplift Model 322
    13.3 Summary 324
    Problems 325
    PART V MINING RELATIONSHIPS AMONG RECORDS
    CHAPTER 14 Association Rules and Collaborative Filtering 329
    14.1 Association Rules 329
    Discovering Association Rules in Transaction Databases 330
    Example 1: Synthetic Data on Purchases of Phone Faceplates 330
    Generating Candidate Rules 330
    The Apriori Algorithm 333
    Selecting Strong Rules 333
    Data Format 335
    The Process of Rule Selection 336
    Interpreting the Results 337
    Rules and Chance 339
    Example 2: Rules for Similar Book Purchases 340
    14.2 Collaborative Filtering 342
    Data Type and Format 343
    Example 3: Netflix Prize Contest 343
    User-Based Collaborative Filtering: "People Like You" 344
    Item-Based Collaborative Filtering 347
    Advantages and Weaknesses of Collaborative Filtering 348
    Collaborative Filtering vs. Association Rules 349
    14.3 Summary 351
    Problems 352
    CHAPTER 15 Cluster Analysis 357
    15.1 Introduction 357
    Example: Public Utilities 359
    15.2 Measuring Distance Between Two Records 361
    Euclidean Distance 361
    Normalizing Numerical Measurements 362
    Other Distance Measures for Numerical Data 362
    Distance Measures for Categorical Data 365
    Distance Measures for Mixed Data 366
    15.3 Measuring Distance Between Two Clusters 366
    Minimum Distance 366
    Maximum Distance 366
    Average Distance 367
    Centroid Distance 367
    15.4 Hierarchical (Agglomerative) Clustering 368
    Single Linkage 369
    Complete Linkage 370
    Average Linkage 370
    Centroid Linkage 370
    Ward's Method 370
    Dendrograms: Displaying Clustering Process and Results 371
    Validating Clusters 373
    Limitations of Hierarchical Clustering 375
    15.5 Non-Hierarchical Clustering: The k-Means Algorithm 376
    Choosing the Number of Clusters (k) 377
    Problems 382
    PART VI FORECASTING TIME SERIES
    CHAPTER 16 Handling Time Series 387
    16.1 Introduction 387
    16.2 Descriptive vs. Predictive Modeling 389
    16.3 Popular Forecasting Methods in Business 389
    Combining Methods 389
    16.4 Time Series Components 390
    Example: Ridership on Amtrak Trains 390
    16.5 Data-Partitioning and Performance Evaluation 395
    Benchmark Performance: Naive Forecasts 395
    Generating Future Forecasts 396
    Problems 398
    CHAPTER 17 Regression-Based Forecasting 401
    17.1 A Model with Trend 401
    Linear Trend 401
    Exponential Trend 405
    Polynomial Trend 407
    17.2 A Model with Seasonality 407
    17.3 A Model with Trend and Seasonality 411
    17.4 Autocorrelation and ARIMA Models 412
    Computing Autocorrelation 413
    Improving Forecasts by Integrating Autocorrelation Information 416
    Evaluating Predictability 420
    Problems 422
    CHAPTER 18 Smoothing Methods 433
    18.1 Introduction 433
    18.2 Moving Average 434
    Centered Moving Average for Visualization 434
    Trailing Moving Average for Forecasting 435
    Choosing Window Width (w) 439
    18.3 Simple Exponential Smoothing 439
    Choosing Smoothing Parameter 440
    Relation Between Moving Average and Simple Exponential Smoothing 440
    18.4 Advanced Exponential Smoothing 442
    Series with a Trend 442
    Series with a Trend and Seasonality 443
    Series with Seasonality (No Trend) 443
    Problems 446
    PART VII DATA ANALYTICS
    CHAPTER 19 Social Network Analytics 455
    19.1 Introduction 455
    19.2 Directed vs. Undirected Networks 457
    19.3 Visualizing and Analyzing Networks 458
    Graph Layout 458
    Edge List 460
    Adjacency Matrix 461
    Using Network Data in Classification and Prediction 461
    19.4 Social Data Metrics and Taxonomy 462
    Node-Level Centrality Metrics 463
    Egocentric Network 463
    Network Metrics 465
    19.5 Using Network Metrics in Prediction and Classification 467
    Link Prediction 467
    Entity Resolution 467
    Collaborative Filtering 468
    19.6 Collecting Social Network Data with R 471
    19.7 Advantages and Disadvantages 474
    Problems 476
    CHAPTER 20 Text Mining 479
    20.1 Introduction 479
    20.2 The Tabular Representation of Text: Term-Document Matrix and "Bag-of-Words" 480
    20.3 Bag-of-Words vs. Meaning Extraction at Document Level 481
    20.4 Preprocessing the Text 482
    Tokenization 484
    Text Reduction 485
    Presence/Absence vs. Frequency 487
    Term Frequency-Inverse Document Frequency (TF-IDF) 487
    From Terms to Concepts: Latent Semantic Indexing 488
    Extracting Meaning 489
    20.5 Implementing Data Mining Methods 489
    20.6 Example: Online Discussions on Autos and Electronics 490
    Importing and Labeling the Records 490
    Text Preprocessing in R 491
    Producing a Concept Matrix 491
    Fitting a Predictive Model 492
    Prediction 492
    20.7 Summary 494
    Problems 495
    PART VIII CASES
    CHAPTER 21 Cases 499
    21.1 Charles Book Club 499
    The Book Industry 499
    Database Marketing at Charles 500
    Data Mining Techniques 502
    Assignment 504
    21.2 German Credit 505
    Background 505
    Data 506
    Assignment 507
    21.3 Tayko Software Cataloger 510
    Background 510
    The Mailing Experiment 510
    Data 510
    Assignment 512
    21.4 Political Persuasion 513
    Background 513
    Predictive Analytics Arrives in US Politics 513
    Political Targeting 514
    Uplift 514
    Data 515
    Assignment 516
    21.5 Taxi Cancellations 517
    Business Situation 517
    Assignment 517
    21.6 Segmenting Consumers of Bath Soap 518
    Business Situation 518
    Key Problems 519
    Data 519
    Measuring Brand Loyalty 519
    Assignment 521
    21.7 Direct-Mail Fundraising 521
    Background 521
    Data 522
    Assignment 523
    21.8 Catalog Cross-Selling 524
    Background 524
    Assignment 524
    21.9 Predicting Bankruptcy 525
    Predicting Corporate Bankruptcy 525
    Assignment 526
    21.10 Time Series Case: Forecasting Public Transportation Demand 528
    Background 528
    Problem Description 528
    Available Data 528
    Assignment Goal 528
    Assignment 529
    Tips and Suggested Steps 529
    References 531
    Data Files Used in the Book 533
    Index 535