Notes_7 IE

download Notes_7 IE

of 134

Transcript of Notes_7 IE

  • 8/11/2019 Notes_7 IE

    1/134

    Nonlinear Regression

    Reference reading:

    pp. 2733 and 564567 for MLE

    Ch. 13. Except you can just skim 13.2 (optimizationalgorithms) and 13.6 (neural networks)

    pp. 453457 for regression trees

    pp. 458464, 529536, and Lab 5 for bootstrapping

  • 8/11/2019 Notes_7 IE

    2/134

    Maximum Likelihood Estimation (MLE)

    Given a parametric model for a set of data, in general,how do you devise a good way to estimate theparameters? i.e., what criterion do you optimize?

    MLE is a very general principle on which manyparametric model estimators are derived Many familiar estimators from a first course in statistics turn out

    to be MLEs

    The model fitting criteria in linear and logistic regression canboth be derived as applications of MLE; likewise for manysupervised learning models

    When a researcher proposes a new model for a problem, theyusually start with the MLE principle to fit the model

    In statistical modeling software, choosing a method of modelfitting is often related to choosing a statistical model for which

    the method of fitting is the corresponding MLE

  • 8/11/2019 Notes_7 IE

    3/134

    The MLE Principle

    Suppose you have some parametric model to representyour data, with parameters denoted by q{q1, q2, . . .,qp}, and you want to fit the model (i.e., estimate theparameters) based on a random sample of data Y{y1,

    y2, . . .,yn}.

    Denote the joint distribution of the data byf(y1,y2, . . .,yn;q1, q2, . . ., qp), orf(Y; q)for short. We callf(Y; q): the prob. distribution, when viewed as a function of Y, for fixed

    values ofq

    , or

    the likelihood function, when viewed as a function of qfor thefixed values of Yin your actual data sample.

    Basic MLE Principle: Take the estimates of qto bevalues that maximize the likelihood functionf(Y; q). Wecall these values the MLE of q.

  • 8/11/2019 Notes_7 IE

    4/134

    Example: Estimating mand sfor a Normal Pop.

    data: Y= {y1,y2, . . .,yn} (suppose i.i.d. sample)model: Yi~NID(m,s2)

    parameters: q= {m, s} (p= 2)

    marginal pdf of Yi: f(yi; m,s) =

    joint pdf of Y1, . . ., Yn(aka likelihood function):

    f(Y; m,s) =

    MLEs of mand sare the values that maximizef(Y; m,s)

    n

    iin/n

    n

    ii

    yexp,;yf1

    2

    221 2

    1

    2

    1 mss

    sm

    m

    ss

    yexp i2

    22

    1

    2

    1

  • 8/11/2019 Notes_7 IE

    5/134

    Example: Estimating the coefficients in LogisticRegression

    data: Y= {y1,y2, . . .,yn} (suppose i.i.d. sample)model: for i= 1, 2, . . ., n, Yi~ Bernoulli with

    where xi= [1,xi1, . . .xik]T (kpredictor variables)

    parameters: = [b0,b1, . . .,bk]T(p= k+1)

    marginal distribution of Yi: f(yi; ) =

    joint distribution of Y1, . . ., Yn:

    f(Y; ) =

    xx

    x

    iT

    iT

    ikki

    ikkiiii

    exp

    exp

    xxexp

    xxexp|YPrp

    111

    110

    110

    bbb

    bbb

    0:1

    1:

    yp

    yp

    ii

    ii

    n

    yi i

    T

    n

    yi i

    T

    iT

    n

    yi

    i

    n

    yi

    i

    n

    i

    i

    iiiiexpexp

    exppp;yf

    01

    11

    01

    111 1

    1

    1

    1

    xx

    x

  • 8/11/2019 Notes_7 IE

    6/134

    654321

    90

    80

    70

    60

    50

    40

    30

    20

    10

    0

    car_age

    income

    0

    1

    y

    Scatterplot of income vs car_age

  • 8/11/2019 Notes_7 IE

    7/134

    Nonlinear Regression Models and Nonlinear LeastSquares

    A general form of nonlinear regression model is Yi=g(xi,q) + ei, where:

    Yi: response for observation i

    xi: vector of predictors for observation i

    q: vector of model parametersg(xi,q): some parametric nonlinear function

    ei: zero-mean random error for observation i

    We will see shortly that if the random errors areGaussian and independent of x, the MLE of qis justnonlinear least squares

  • 8/11/2019 Notes_7 IE

    8/134

    Example: Manufacturing Learning Curve

    Y= relative efficiency of operation

    x2= week #

    If there were only one facility, and the data looked likebelow, how would you model it?

    (older)Afacility:0

    (modern)Bfacility:11x

    x2

    y

    1.0

  • 8/11/2019 Notes_7 IE

    9/134

    Discussion Points and Questions

    If facilities A and B had different asymptotic efficienciesas in Fig. 13.5, how would you modify the model?

    If facilities A and B had different exponential rates, howwould you modify the model?

    If the objective was to determine if the two facilities haddifferent asymptotic efficiencies, how could you do this?

    Are the formulae for t-tests, standard errors, etc. in alinear regression still valid? If not, how would youcalculate and use the analogous quantities in nonlinearregression?

  • 8/11/2019 Notes_7 IE

    10/134

  • 8/11/2019 Notes_7 IE

    11/134

    MLE for General Nonlinear Regression Model withNormal Errors

    Yi=g(xi,q) + eiwith error distribution: ei~NID(0,s2)

    view the xi's as deterministic, not random

    write Yi= mi+ ei

    with mi

    g(xi,q) (to simplify notation) Yi ~NID(mi,s

    2)

    marginal pdf of Yi: f(yi; q,s) =

    joint pdf of Y1, . . ., Yn(aka likelihood function):

    f(Y; q,s) =

    m

    ss

    iiyexp2

    22

    1

    2

    1

    n

    i

    iin/n

    n

    i

    i yexp,;yf

    1

    2

    22

    1 2

    1

    2

    1m

    ss

    s

  • 8/11/2019 Notes_7 IE

    12/134

    MLE of q: Choose to maximize f(Y; q,s)

    i.e., minimize

    i.e, the MLE of qfor the general nonlinear regressionmodel with i.i.d. Gaussian errors (that are independent ofx) is "nonlinear least squares"

    In general, we need optimization software to fit themodel

    n

    iii

    n

    iii ,gyy

    1

    2

    1

    2xm

  • 8/11/2019 Notes_7 IE

    13/134

    Summary of Steps in General MLE

    1) Write out the form of the statistical model that you areusing to represent the data

    2) Find the marginal distribution of each individualobservation Yi(for regression problems the xi's are nottreated as random, so you only need to find themarginal distribution of the Yi's given the xi's)

    3) From the marginal distributions in step (2), find the jointdistributionf(Y; q)of the entire set of data Y

    4) If tractable, find an analytical expression for the qthatmaximizes the likelihoodf(Y; q). Otherwise, useoptimization software to minimize logf(Y; q)

    5) The MLE of qis the minimizer in step (4), and theHessian can be used to assess statistical uncertainty

    (next topic)

  • 8/11/2019 Notes_7 IE

    14/134

    Relevant R Functions and Packages

    nlm()minimize a general nonlinear function, such asimplementing MLE for a nonstandard model (but mostspecific statistical models in R have built-in MLEimplementation)

    nls()nonlinear least squares

    boot bootstrapping package

    cross-validation is built into many R modeling functions(as an optional argument or as a separate function likecv.tree or cv.glm), or not hard to write your own function

  • 8/11/2019 Notes_7 IE

    15/134

    R commands for fitting learning curve exampleusing the general optimizer nlm()

    MLC

  • 8/11/2019 Notes_7 IE

    16/134

    R commands for fitting learning curve exampleusing the nonlinear LS function nls()

    MLC

  • 8/11/2019 Notes_7 IE

    17/134

    Statistical Uncertainty in Supervised Learning

    With nonlinear regression models, the formulae for assessing

    statistical uncertainty in linear regression (e.g.,F-tests and t-tests forsignificance of predictors, SEs and CIs for parameters, PIs and CIsfor new observations, etc.) do not apply directly

    Question: Why might we want to calculate SEs, CIs/PIs, dohypothesis tests, etc?

    For some nonlinear models, we can use approximate asymptoticanalytical resultsvalid for sufficiently large sample size nto assessstatistical uncertainty

    Fortunately, we have alternative computational approachesthatapply to any nonlinear model:

    Cross-validationfor deciding which models are the best (whichimplies which terms belong in the model, among other things)

    Bootstrap resampling(or bootstrappingfor short) for SEs andCIs on the parameters and CIs and PIs on new observations

  • 8/11/2019 Notes_7 IE

    18/134

    Overview of Bootstrapping

    You are given a sample of data of size nobservations.

    You have estimated some parameter(s) q(call it )

    Objective: Estimate the sampling distribution of andquantities like SE( )that are derived from it.

    Problem: Hypothetically, if we knew the entirepopulation, we could consider using simulation to drawmany random samples (each of size n) from thepopulation and calculate a different for each sample.

    We could construct a histogram of all the 's and taketheir sample standard deviation to be an estimate ofSE( ) for the single real sample. The problem is we onlyhave the single sample and not the entire population.

    q

    q

    q

    q

    q

    q

  • 8/11/2019 Notes_7 IE

    19/134

    Example: How you could use regular simulation tofind the SE of a sample average, if you know the

    underlying distribution (for example, normal)

    Generate say 10,000 samples,each of size n = 20, from anN(5.3,0.4^2) distribution

    Calculate the averages { : j

    = 1, 2, . . ., 10,000} for the10,000 replicates

    Take

    y y(1) y(2) y(3) y(4)

    5.32 5.18 4.79 5.40 5.81

    5.37 5.78 5.99 4.43 5.21

    5.23 5.74 4.87 5.02 4.62

    5.33 4.56 4.91 4.99 4.45

    6.07 5.07 5.14 5.35 5.15

    4.88 5.17 5.15 5.84 5.27

    5.38 5.23 5.09 6.09 5.65

    5.04 6.25 5.04 5.96 4.665.68 5.52 5.66 6.07 5.27

    5.44 5.09 5.57 5.15 5.60

    5.55 4.72 4.96 4.69 5.15

    4.93 5.29 5.31 5.17 6.18

    4.71 4.60 5.01 4.27 5.88

    4.71 4.79 5.04 5.60 5.49

    4.63 5.65 5.54 4.75 4.85

    5.26 5.58 5.43 4.92 5.20

    5.67 5.35 5.52 5.36 4.945.87 6.05 5.49 5.33 5.63

    5.74 5.64 5.05 4.93 5.74

    5.17 4.82 4.68 5.58 5.56

    ave 5.30 5.30 5.21 5.25 5.32

    SD 0.40

    00010

    1

    2

    100010

    1 ,

    j sim

    )j(

    sim yy,ySE

    y )j(sim

  • 8/11/2019 Notes_7 IE

    20/134

    Example: How you could use bootstrapping to findthe SE of a sample average, if you do NOT know

    the underlying distribution

    Generate say 10,000 bootstrapsamples, each of size n = 20,from your one real sample

    Calculate the averages { :b= 1, 2, . . ., 10,000} for the10,000 replicates

    Take

    00010

    1

    2

    100010

    1 ,

    b

    )b(

    yy,ySE

    y )b(

    y y(1) y(2) y(3) y(4)

    5.32 5.44 5.04 5.38 5.55

    5.37 4.63 5.87 4.71 5.74

    5.23 5.67 4.93 5.68 6.07

    5.33 4.71 4.93 5.23 4.63

    6.07 4.71 5.87 5.44 5.67

    4.88 4.71 5.23 4.88 5.68

    5.38 5.37 5.33 5.38 4.71

    5.04 5.38 5.87 4.71 5.235.68 5.26 5.04 5.55 5.23

    5.44 5.55 5.44 5.23 5.17

    5.55 4.63 4.88 5.17 5.23

    4.93 5.68 6.07 5.23 5.68

    4.71 5.68 4.93 5.33 5.26

    4.71 5.67 5.23 4.71 5.17

    4.63 5.87 5.17 5.17 4.63

    5.26 5.44 5.37 5.04 5.23

    5.67 4.88 5.23 5.23 6.075.87 5.33 5.33 5.37 5.74

    5.74 5.32 5.23 5.68 4.88

    5.17 5.23 5.33 5.32 5.37

    ave 5.30 5.26 5.32 5.22 5.35

    SD 0.40

  • 8/11/2019 Notes_7 IE

    21/134

    Bootstrapping overview continued

    Solution: Make a pretend population that consists ofyour original sample of nobservations, copied over andover, an infinite number of times. Then draw many"bootstrap" random samples (each of size n) from thepretend population and calculate a different for each

    sample. You can construct a histogram of all the 's,take their sample standard deviation to be an estimate ofSE( ), etc.

    How this is implemented: You do not have to actually

    copy your original sample over and over. The aboveconstruction of each bootstrap sample is equivalent todrawing a random sample of size nfrom the originalsample of data (with replacement).

    q

    q

    q

  • 8/11/2019 Notes_7 IE

    22/134

    A Different Example (that has nothing to do withnonlinear regression)

    Pop0= population of all grains

    Pop1 = population of all grains with thickness < 0.3 and

    equivalent diameter > 0.6

    mR= mean aspect ratio for all grains inPop1

    f= = fraction projected area of grains

    inPop1

    The patent claim is violated iff> 0.5 AND mR

    > 8

    PopPop

    0

    1

    incrystalsallofAreaincrystalsallofArea

  • 8/11/2019 Notes_7 IE

    23/134

  • 8/11/2019 Notes_7 IE

    24/134

    Some Details: Bootstrapping in Nonlinear Regression

    You have a sample of nobservations of aresponse variable and a set of predictor variables.

    You fit a nonlinear regression model to the data toestimate a set of parameters q

    Let qdenote one of the parameters of interest and itsestimate.

    Objective:Estimate the sampling distribution of , itsstandard error, a confidence interval for q, etc.

    To do this, follow the steps of the bootstrap procedure onthe subsequent slides

    q

    ni

    ii

    ,y1

    x

    q

  • 8/11/2019 Notes_7 IE

    25/134

    Steps of the Bootstrap Procedure

    1) Generate a "bootstrap" sample (with replacement) of nobservations from . Denote the bootstrapsample by

    2) Fit the same type of regression model (with the sameset of parameters q and parameter qof special interest)to the bootstrapped sample. Denote the estimates forthe bootstrapped sample by and

    3) Pick a large numberB(e.g.,B= 10,000), and repeatSteps (1) and (2) a total ofBtimes, which produces

    b

    niii,y 1x

    ni

    bi

    bi ,y 1x

    qb

    Bbb 1q

  • 8/11/2019 Notes_7 IE

    26/134

    Steps of the Bootstrap Procedure, continued

    4) Construct a histogram of and calculate:

    q

    Bb

    b

    B

    1

    1qq

    112

    B

    SE

    B

    bb

    qqq

    average of all bootstrapped estimates

    standard error of

    q /2 upper /2quantile of the sample distribution of

    Bb

    b

    1q

    Bbb 1q

    q /21 lower /2quantile of the sample distribution of Bbb 1q

  • 8/11/2019 Notes_7 IE

    27/134

  • 8/11/2019 Notes_7 IE

    28/134

    Steps of the Bootstrap Procedure, continued

    5) A crude 1confidence interval for qis:

    6) A better 1confidence interval for qis:

    qqqqq SEzSEz // 22

    qqqqqqq // 212

    q q /2q /21

    q

  • 8/11/2019 Notes_7 IE

    29/134

    Example CI Calculations for q0for the Manu. LearningCurve

    Crude 95% CI:

    Reflected 95% CI:

    0241008100409610161020 .,....SEz / qq

    qqqqqq , ,.,. 975000002500

    01510 . q

    00400 .SE q02310250 . ,. q

    007.1 975,.0 q

    (from the left-most histogram two slides prior)

    01610 . q

    00710161.0161016102310161 ..,...

    .02510091 ,.

  • 8/11/2019 Notes_7 IE

    30/134

    Discussion Points and Questions

    What is the difference between the two CIs (crudeversus reflected) on the previous slide?

    In general, when would the two confidence intervalsdiffer?

    What are the effects of increasingBon the bootstrappedhistogram of a parameter estimate? Would the histogrambecome tighter?

    What are the effects of increasing non the bootstrappedhistogram of a parameter estimate? Would the histogram

    become tighter?

    Why must nfor each bootstrapped sample be the sameas nfor the real sample?

  • 8/11/2019 Notes_7 IE

    31/134

    R commands for bootstrapping parameter SEs/CIsfor the manufacturing learning curve

    library(boot) #need to load the boot package

    MLC

  • 8/11/2019 Notes_7 IE

    32/134

  • 8/11/2019 Notes_7 IE

    33/134

    > plot(MLCboot,index=1)

    > boot.ci(MLCboot,conf=c(.9,.95,.99),index=1,type=c("norm","basic"))

    BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

    Based on 1000 bootstrap replicates

    CALL :

    boot.ci(boot.out = MLCboot, conf = c(0.9, 0.95, 0.99), type = c("norm",

    "basic"), index = 1)

    Intervals :

    Level Normal Basic

    90% ( 1.010, 1.020 ) ( 1.011, 1.020 )95% ( 1.010, 1.021 ) ( 1.010, 1.022 )

    99% ( 1.008, 1.023 ) ( 1.009, 1.023 )Histogram of t

    t*

    Density

    1.005 1.015 1.025

    0

    50

    100

    150

    -3 -2 -1 0 1 2 3

    1.005

    1.01

    5

    Quantiles of Standard Normal

    t*

  • 8/11/2019 Notes_7 IE

    34/134

    Discussion Points and Questions

    1) In boot.ci, type = "norm" gives our crude CI based on the SE and

    the normal percentiles, but translated by subtracting out theestimated Bias (taken to be the bootstrap average minus theoriginal parameter estimate); type = basic interval gives the betterCI obtained by reflecting the percentiles.

    2) How can we determine if there is statistically significant evidence

    that the asymptotic relative efficiencies of the two manufacturingfacilities differ?

    3) What is a 95% CI on the asymptotic relative efficiency of the olderfacility (x1 = 0)?

    4) What is a 95% CI on the asymptotic relative efficiency of the newer

    facility (x1 = 1)?

    5) In general, given the covariance matrix Sof a random vector Z, thevariance of the linear combination aTZis

    Var(aTZ) = aTSa

  • 8/11/2019 Notes_7 IE

    35/134

    Comments on Bootstrapping

    Comparison class verses textbook notation:

    In R, use the "boot" command in the "boot" package In Matlab, use the "bootstrp" command in the stats

    toolbox

    Class KNN

    bootstrap parameter estimate qb b1*

    upper /2 percentile of distributionof bootstrapped parameters q /2 b1*(1/2)

    lower /2 percentile of distribution

    of bootstrapped parameters q

    /21 b1*(/2)

    bootstrap sample of data n

    ibi

    bi ,y 1x

    ni

    *i

    *i,y 1x

  • 8/11/2019 Notes_7 IE

    36/134

    Some Common Blackbox Nonlinear Regressionand Classification Models

    If you have knowledge of the structure of the relationship between Y

    and x, then the best approach is to use it (e.g., if you think it is anlinear, exponential, quadratic, etc. relationship, then fit that model)

    For many data sets (especially large "data mining" applications), wemight doubt a linear model will fit but have no idea of the structure ofthe nonlinearities.

    In this case, unless there are only a few predictors, polynomial(e.g., quadratic) models are not the preferred next step to trybeyond linear models

    Why not?

    There are many blackbox nonlinear modeling approaches

    We will cover some common ones (neural networks, CARTmodels, nearest neighbors) that span the spectrum of methods

    Almost all can be used equally well for either regression orclassification

  • 8/11/2019 Notes_7 IE

    37/134

    Neural Networks

    Clever original idea and memorable namebecamevery popular in the 1980s and 1990s.

    They have evolved to have less resemblance to how thehuman brain processes information (but bettereffectiveness at modeling nonlinear relationships in

    complicated data sets)

    To fit a neural network model (and all of the otherblackbox models), the training data must be available inthe same format as for linear/logistic regression:

    A 2D array of observations Each column is a different variable; each row a different case

    One column is the response variable (Y) and the other columnsare any number of predictor variables (X's)

    The neural network hidden variables (H's) are internal variablesthat you do not enter or even care about

  • 8/11/2019 Notes_7 IE

    38/134

    Standard Graphical Depiction of a Neural Network

  • 8/11/2019 Notes_7 IE

    39/134

    Mathematical Definition of What a Neural NetworkModel Really Is

    each "node" represents an activation function (labeled as the

    function output, with function input a linear combo of previous

    layer function outputs)

    X's: input (i.e., predictor) variables, in "input layer"

    Y: output (i.e. response) variable, in "output layer"

    H's: internal dummy variables, in "hidden layer" 's andb's: model parameters, to be estimated

    the NN model:

    for m= 1, 2, . . .,M,

    xxexpxxexp

    Hkk,m,m,m

    kk,m,m,mm

    110

    110

    1

    e

    bbb

    bbb

    HHexp

    HHexpY

    MM

    MM

    110

    110

    1

  • 8/11/2019 Notes_7 IE

    40/134

    Neural Network Activation Functions

    For classification, it is common to use the samesigmoidal (logistic) activation function for each node:

    wherez= linear combo of variables from previous layer

    For regression, it is usually preferable to use sigmoidal

    activation functions for all hidden nodes and a linearactivation[i.e., h(z) =z]function for the output layernodes:

    zexpzexpzexp

    zh

    1

    1

    1

    ebbb HHY MM110

  • 8/11/2019 Notes_7 IE

    41/134

  • 8/11/2019 Notes_7 IE

    42/134

    An S-shaped function with multivariate input

    Recall that this is what the S-shaped logistic functionlooks like when there are multiple input variables

  • 8/11/2019 Notes_7 IE

    43/134

    Discussion Points and Questions

    Yis an S-shaped (or sometimes linear) function of thedummy variables (H's), which are in turn S-shapedfunctions of the predictors (X's)

    When you combine them together, substituting for theH's to get Yas a function of theX's, you can think of the

    neural network model as

    for some (very messy) with q= {all 's andb's}

    What kind of functionalXYrelationships can youcapture with the neural network model structure?

    e x,gY

    x,g

  • 8/11/2019 Notes_7 IE

    44/134

    Fitting A Neural Network Model

    1) Standardize predictors via

    = average, stdev ofjth predictor (jth column)

    2) If using logistic output activation function, scaleresponse to interval [0,1] via

    Why do we need to do this rescaling for a logistic output

    activation function?

    s

    xxx

    x

    jijij

    j

    s,x xj j

    yyyyy

    minmax

    minii

  • 8/11/2019 Notes_7 IE

    45/134

    Fitting A Neural Network Model, continued

    3) Choose:

    # hidden layers

    # nodes in each hidden layer

    output activation function (usually linear or logistic)

    other options and tuning parameters (e.g. l)

    4) Software estimates parameters to minimize (nonlinearLS with shrinkage):

    g(xi,q)denotes the neural network response prediction

    M

    mm

    M

    m

    k

    jj,m

    n

    iii ,gy

    0

    2

    1 0

    2

    1

    2blx

  • 8/11/2019 Notes_7 IE

    46/134

    where

    q= {all 's andb's}

    l= user-chosen shrinkage parameter

    The shrinkage term is analogous to the term that we add to theSSE in ridge regression

    Why do we need to include the shrinkage term when fitting a

    neural network, even if we have no multicollinearity?

    HHexp

    HHexp,g

    M,iM,i

    M,iM,ii

    bbb

    bbb

    110

    110

    1

    x

    xxexpxxexp

    Hk,ik,m,i,m,m

    k,ik,m,i,m,mm,i

    110

    110

    1

    E l P di ti M d li f CPU

  • 8/11/2019 Notes_7 IE

    47/134

    Example: Predictive Modeling of CPUperformance

    Data in cpus.txt, which is the same as the cpus data in the MASS

    package 209 cases, with 9 variables and 6 predictor variables

    perf is the response, which is CPU performance. Ignore estperfwhich was another authors estimated performance.

    The six numerical predictors are cycle time (nanoseconds), cachesize (Kb), min and max main memory size (Kb), and min and maxnumber of channels. See V&R for additional discussion

    The objective is to learn the predictive relationship between CPUperformance and the predictor variables

    Example with a bigger data set coming up shortly

  • 8/11/2019 Notes_7 IE

    48/134

    Neural Network Modeling of CPU data

    #######R code for reading in cpus data set, taking log(response) and then converting to

    [0,1] interval, and standardizing predictors##############

    CPUS

  • 8/11/2019 Notes_7 IE

    49/134

    Matrix scatterplot of transformed cpus data

    syct

    -3 -1 1 0 2 4 0 2 4 6

    -2

    0

    2

    -3

    -1

    1

    mmin

    mmax

    -4

    -2

    0

    2

    0

    2

    4

    cach

    chmin0

    2

    4

    6

    0

    2

    4

    6

    chmax

    -2 0 2 -4 -2 0 2 0 2 4 6 0.0 0.4 0.8

    0.0

    0.4

    0.8

    perf

  • 8/11/2019 Notes_7 IE

    50/134

    CPUS Example Continued

    #############Fit a neural network model to the CPUS1 data####################

    library(nnet)cpus.nn1

  • 8/11/2019 Notes_7 IE

    51/134

    Discussion Points and Questions

    Why do we need to standardize the predictors (and theresponse variable when using a linear output activationfunction)?

    How can we get r^2 for this example (the nnet function inR does not spit it out)

    Which predictor variables appear to be the mostimportant, and what R output do we look at to determinethis?

    What value of lwill give us the smallest training SSE?

    How can we decide the best value of l?

  • 8/11/2019 Notes_7 IE

    52/134

    CPUS Example Continued

    #######A function to determine the indices in a CV partition##################

    CVInd

  • 8/11/2019 Notes_7 IE

    53/134

    CPUS Example Continued

    ##Now use the same CV partition to compare Neural Net and linear reg models###

    Ind

  • 8/11/2019 Notes_7 IE

    54/134

    Discussion Points and Questions

    The best value of lis the value that results in the

    smallest CV SSE (or equivalently, the largest CV r^2,smallest CV SD(e), etc).

    How can we decide the best number of hidden layernodes?

    Why should we use the same CV partition whencomparing two models?

    What are the pros and cons of n-fold CV versus K-foldCV for some smaller K, e.g., 3, 5, or 10?

  • 8/11/2019 Notes_7 IE

    55/134

    Example: Predictive Modeling of Income Data

    Data in adult_train.csv is from the 1994 US Census (also see

    http://archive.ics.uci.edu/ml/datasets/Census+Income) 32561 cases, with 15 variables. This is a small sample from the US

    census with 15 potentially relevant variables. Each row represents a"similar" population segment with weight given by "fnlwgt"

    income has been converted to a binary categorical variable (

    50k) with roughly a 75%/25% population split Later we will fit predictive models to classify income based on the

    other variables (classification). Here, the objective is to predict thenumber of hours per week spent working based on the othervariables (regression)

    This is already a very cleaned data set, but we may need to do alittle additional cleaning

    What should we do about the missing "?" values

  • 8/11/2019 Notes_7 IE

    56/134

    The First Few Rows

    age workclass fnlwgt education education-marital-st occupatio relationshi race sex capital-gai capital- los hours-per- native-couincome

    39 State-gov 77516 Bachelors 13 Never-ma Adm-cleri Not-in-fa White Male 2174 0 40 United-St

  • 8/11/2019 Notes_7 IE

    57/134

    Read in the Data

    XX

  • 8/11/2019 Notes_7 IE

    58/134

    Some Preliminary Exploratory Analyses

    ##exploring individual variables

    par(mfrow=c(2,3)); for (i in c(1,5,11,12,13)) hist(XX[[i]],xlab=names(XX)[i]); plot(XX[[15]])par(mfrow=c(1,1)); plot(XX[[2]],cex.names=.7)

    for (i in c(2,4,6,7,8,9,10,14,15)) print(table(XX[[i]])/nrow(XX))

    Should we be concerned

    with anything here or doany further cleaning?

    Histogram of XX[[i]]

    age

    Frequency

    20 40 60 80

    0

    1000

    2000

    3000

    4000

    Histogram of XX[[i]]

    education.num

    Frequency

    5 10 15

    0

    2000

    4000

    6000

    8000

    10000

    Histogram of XX[[i]]

    capital.gain

    Frequency

    0e+00 4e+04 8e+04

    0

    5000

    15000

    250

    00

    Histogram of XX[[i]]

    capital.loss

    Frequency

    0 1000 2000 3000 4000

    0

    5000

    15000

    25000

    Histogram of XX[[i]]

    hours.per.week

    Frequency

    0 20 40 60 80 100

    0

    5000

    10000

    15000

    50K

    0

    5000

    10000

    15000

    20000

  • 8/11/2019 Notes_7 IE

    59/134

    Some Preliminary Exploratory Analyses

    ##exploring pairwise predictor/response relationships

    par(mfrow=c(2,1))plot(jitter(XX$age,3),jitter(XX$hours.per.week,3),pch=16,cex=.5)

    plot(jitter(XX$education.num,3),jitter(XX$hours.per.week,3),pch=16,cex=.5)

    par(mfrow=c(1,1))

    barplot(tapply(XX$hours.per.week,XX$education,mean),ylim=c(30,50),cex.names=.7,xpd=F)

    for (i in c(2,4,6,7,8,9,14,15)) {print(tapply(XX$hours.per.week,XX[[i]],mean)); cat("\n")}

    Some points to consider regarding correlation versus functional dependence (pointsthat apply to ANY regression analysis)

    If hours.per.week appears correlated with another variable, it does not meanthat hours.per.week has a functional dependence on that variable

    The two could appear correlated because they both depend on another variable(either one of the existing variables or an unrecorded nuisance variable)

    If you have recorded enough nuisance variables, a multiple regression analysiscan sometimes distinguish which correlations are truly due to a functionaldependence

    If your goal is pure prediction (and not explanatory), does it matter?

  • 8/11/2019 Notes_7 IE

    60/134

    10th 11th 12th 1st-4th 5th-6th 7th-8th 9th Assoc-acdm Assoc-voc Bachelors Doctorate HS-grad Masters Preschool Prof-school Some-college

    30

    35

    40

    45

    50

  • 8/11/2019 Notes_7 IE

    61/134

    A Typical Next Step in Predictive Modeling

    ##linear regression with all predictors included

    Inc.lm

  • 8/11/2019 Notes_7 IE

    62/134

    Some Typical Next Steps

    ##linear regression including interactions

    Inc.lm.full

  • 8/11/2019 Notes_7 IE

    63/134

    Now Try a Neural Network Model

    ##Neural network model

    library(nnet)Inc.nn1

  • 8/11/2019 Notes_7 IE

    64/134

    Multi-Response Neural Networks

    Neural networks also apply to the situation in which we

    have more than one (sayK) response variables

    We handle this by includingKnodes in the output layer(see the following slide)

    This is different than fittingKseparate neural networks,

    one for each response, because theKresponses sharethe same hidden layer node functions

    This is generally more effective than fittingKseparateneural networks models if the response variables have

    similar functional dependencies on the predictors. If theresponses have completely different dependencies onthe predictors, then you are better off fittingKseparateneural networks models

    Graphical Depiction of Neural Network with K

  • 8/11/2019 Notes_7 IE

    65/134

    Graphical Depiction of Neural Network withKResponse Variables

    f C f

  • 8/11/2019 Notes_7 IE

    66/134

    Neural Networks for Classification

    The most common application of multi-response neural

    networks is for classification when we have a categoricalresponse withKcategories (aka classes). Note that thisalso applies to binary responses (K= 2)

    To handle this (most software does this internally), make

    aK-length 0/1 response vector, e.g., for the fgl data:Type y1 y2 y3 y4 y5 y6

    WinF 1 0 0 0 0 0

    WinNF 0 1 0 0 0 0

    Veh 0 0 1 0 0 0

    Con 0 0 0 1 0 0

    Tab1 0 0 0 0 1 0

    Head 0 0 0 0 0 1

    E l P di ti Gl T i F i

  • 8/11/2019 Notes_7 IE

    67/134

    Example: Predicting Glass Type in Forensics

    Data in fgl.txt, which is the same as the FGL data in the MASS

    package. See V&R for additional discussion 214 cases, with 9 predictor variables and a categorical response

    Each row contains the results of an analysis of a fragment of glass

    type is the response, one of six different glass types: window floatglass (WinF: 70 rows), window non-float glass (WinNF: 76 rows),

    vehicle window glass (Veh: 17 rows), containers (Con: 13 rows),tableware (Tabl: 9 rows) and vehicle headlamps (Head: 29 rows).

    Eight of the predictors are the chemical composition of the fragment,and the ninth (RI) is the refractive index

    The objective is to train a predictive model to predict the glass typebased on a fragment of the glass, for forensic purposes

    R d th D t d T f V i bl

  • 8/11/2019 Notes_7 IE

    68/134

    Read the Data and Transform some Variables

    ######Read data, convert response to binary, and standardize predictors#####

    FGL

  • 8/11/2019 Notes_7 IE

    69/134

    First Few Rows of fgl.txt data

    RI Na Mg Al Si K Ca Ba Fe type

    3.01 13.64 4.49 1.1 71.78 0.06 8.75 0 0 WinF

    -0.39 13.89 3.6 1.36 72.73 0.48 7.83 0 0 WinF

    -1.82 13.53 3.55 1.54 72.99 0.39 7.78 0 0 WinF

    -0.34 13.21 3.69 1.29 72.61 0.57 8.22 0 0 WinF

    -0.58 13.27 3.62 1.24 73.08 0.55 8.07 0 0 WinF

    -2.04 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 WinF

    -0.57 13.3 3.6 1.14 73.09 0.58 8.17 0 0 WinF

    -0.44 13.15 3.61 1.05 73.24 0.57 8.24 0 0 WinF

    1.18 14.04 3.58 1.37 72.08 0.56 8.3 0 0 WinF

    -0.45 13 3.6 1.36 72.99 0.57 8.4 0 0.11 WinF

    -2.29 12.72 3.46 1.56 73.2 0.67 8.09 0 0.24 WinF

    -0.37 12.8 3.66 1.27 73.01 0.6 8.56 0 0 WinF

    -2.11 12.88 3.43 1.4 73.28 0.69 8.05 0 0.24 WinF

    -0.52 12.86 3.56 1.27 73.21 0.54 8.38 0 0.17 WinF

    -0.37 12.61 3.59 1.31 73.29 0.58 8.5 0 0 WinF

    -0.39 12.81 3.54 1.23 73.24 0.58 8.39 0 0 WinF

    -0.16 12.68 3.67 1.16 73.11 0.61 8.7 0 0 WinF

    3.96 14.36 3.85 0.89 71.36 0.15 9.15 0 0 WinF

    1.11 13.9 3.73 1.18 72.12 0.06 8.89 0 0 WinF

    -0.65 13.02 3.54 1.69 72.73 0.54 8.44 0 0.07 WinF

    Mathematical Definition of K-Class Neural Network

  • 8/11/2019 Notes_7 IE

    70/134

    Mathematical Definition ofKClass Neural NetworkModel

    for m= 1, 2, . . .,M,

    (same as before)

    for l= 1, 2, . . .,K,

    (multinomial logistic model)

    Note: ForK= 2, this reduces to:

    xxexpxxexp

    Hkk,m,m,m

    kk,m,m,mm

    110

    110

    1

    K

    jMM,j,j,j

    MM,l,l,ll

    HHexp

    HHexp|YPr

    1110

    1101

    bbb

    bbb

    x

    HHexp

    HHexp|YPr

    MM

    MM

    bbb

    bbb

    110

    1101

    11 x

    Fitti A N l N t k M d l f Cl ifi ti

  • 8/11/2019 Notes_7 IE

    71/134

    Fitting A Neural Network Model for Classification

    1--3) The first three steps are the same as before

    4) For classification, software estimates parameters tominimize (nonlinear LS with shrinkage):

    (log-likelihood + shrinkage penalty)

    5) CV should be used to choose any tuning parameters (l,number of nodes, etc)

    lall mall

    m,lmall jall

    j,m

    n

    iil,i

    K

    ll,i a|YrP

    y

    2

    2

    11 blx

    Fitting a Neural Net Classifier for the FGL Data

  • 8/11/2019 Notes_7 IE

    72/134

    g(binary response case)

    #############Fit a neural network classification model to the FGL1 data######

    library(nnet)fgl.nn1

  • 8/11/2019 Notes_7 IE

    73/134

    response vs. predicted probability for fgl data

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    phat

    jitter(y,0.05)

    Using CV to Compare Models for the FGL Data

  • 8/11/2019 Notes_7 IE

    74/134

    Using CV to Compare Models for the FGL Data

    Ind

  • 8/11/2019 Notes_7 IE

    75/134

    Discussion Points and Questions

    What is the best neural network model, in terms of the

    tuning parameters (decay, size, etc)? What is the best CV misclassification rate?

    Is this good?

    What other model(s) would you compare to the bestneural network?

    Classification for the 6 Class FGL Response

  • 8/11/2019 Notes_7 IE

    76/134

    Classification for the 6-Class FGL Response

    #############Same, but use the original 6-category response######

    library(nnet)fgl.nn1

  • 8/11/2019 Notes_7 IE

    77/134

    Neural Network Classification of Income Data

    Reconsider the data in adult_train.csv

    Instead of predicting the number of hours (regression), we will nowpredict the binary income categorization ( 50k) using theother predictor variables

    Recall that for the entire sample, we 75% are 50k

    Read the Data and Fit Models

  • 8/11/2019 Notes_7 IE

    78/134

    Read the Data and Fit Models

    XX

  • 8/11/2019 Notes_7 IE

    79/134

    Discussion Points and Questions

    Which modelneural network or logistic regression

    appears to be better? How good does it appear?

    Pros and Cons of Neural Networks

  • 8/11/2019 Notes_7 IE

    80/134

    Pros and Cons of Neural Networks

    Pros:

    very flexible; with enough nodes, can model almost anynonlinear relationship

    can efficiently model linear behavior if the relationship is trulylinear

    often very good predictive power

    Cons: model fitting can be unstable and sensitive to initial guesses

    for very large data sets, model fitting can be very slow relative tosome methods like trees and linear models, which makes CV

    very computationally expensive overfitting (but can avoid by using CV to choose l)

    sensitive to user-chosen "tuning parameters (but can use CV tochoose them wisely)

    poor interpretability

    Classification and Regression Tree (CART)

  • 8/11/2019 Notes_7 IE

    81/134

    Models

    Perhaps the single most widely used generic nonlinear

    modeling method Very simple idea and very interpretable models

    They usually do not have the best predictive power, butthey serve as the basis for many more advanced

    supervised learning methods (e.g., boosting, randomforests) that have excellent predictive power

    As with neural networks (and most of the methods wewill cover), you can use tree models for either regression

    or classification. We will start with regression.

    Structure of a Regression Tree

  • 8/11/2019 Notes_7 IE

    82/134

    Structure of a Regression Tree

    A final fitted CART model divides the predictor (x) space

    by successively splitting into rectangular regions andmodels the response (Y) as constant over each region

    This can be schematically represented as a "tree": each interior node of the tree indicates on which predictor

    variable you split and where you split each terminal node (aka leaf) represents one region and

    indicates the value of the predicted response in that region

    The following slide illustrates a fitted tree model for anexample from the KNN text (Figure 11.12), in which theobjective is to predict college GPA (the response) as afunction of HS rank and ACT score (two predictors)

    To use a fitted CART for prediction, you start at the rootnode and follow the splitting rules down to a leaf

  • 8/11/2019 Notes_7 IE

    83/134

    Mathematical Representation of Regression Tree

  • 8/11/2019 Notes_7 IE

    84/134

    Mathematical Representation of Regression Tree

    Can still view tree model as where:

    M= total number of regions (terminal nodes)

    Rm= mth region

    I(xRm) = indicator function =

    cm

    =constant predictor overRm

    q= all parameters and structure (M, splits inRms, cms,etc)

    Note that for

    e x;gY

    M

    mmm RIc;g

    1xx

    cRIc;g,R jM

    mmimiji

    1 xxx

    R

    R

    m

    m

    x

    x

    :0

    :1

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    85/134

    Discussion Points and Questions

    What kind of functional xYrelationships can you

    capture with a regression tree model structure? Can a regression tree represent a linear relationship?

    Can it represent a linear relationship as efficiently as aneural network?

    Which type of modelneural network or regression treeis more interpretable?

    Which type of modelneural network or regression treeis easier to fit?

    Given a set of regions, how would you estimate thecoefficients {cm: m= 1, 2, . . .,M}?

    Fitting a Regression Tree

  • 8/11/2019 Notes_7 IE

    86/134

    Fitting a Regression Tree

    A CART model is fit using an array of training data

    structured just like in regression (one response columnand many predictor columns)

    Fitting the model entails growing the tree one node at atime (see next slide for an example)

    At each step, the single best next split (which predictor andwhere to split) is the one that gives the biggest reduction in SSE

    The fitted or predicted response over any region is simply theaverage response over that region. The errors used to calculatethe SSE are the response values minus the fitted values.

    Stop splitting when reduction in SSE with the next split is belowa specified threshold, all node sizes are below a threshold, etc.

    Most algorithms overfit then prune back branches

    After fitting a CART model, software spits out the final

    fitted tree, which can be used for prediction/interpretation

  • 8/11/2019 Notes_7 IE

    87/134

    SSE is Calculated as Follows

  • 8/11/2019 Notes_7 IE

    88/134

    SSE is Calculated as Follows

    For given set of splits:

    = "size" of mthterminal node (region)

    R

    im

    miim

    mi

    yN

    R|yavecx

    x 1

    R#N mim x

    M

    m Rmimi cySSE

    1

    2x

    cRIc;gy,R j

    M

    mmimiiji

    1forthatNote xxx

    Pruning

  • 8/11/2019 Notes_7 IE

    89/134

    Pruning

    Pruning a branch means that you collapse one of the

    internal nodes into a single terminal node Pruning the tree means that you prune a number of

    branches

    Pruning algorithms in software will usually optimally

    prune back a tree in a manner that minimizes SSE + lM,whereMand SSEare for the pruned tree. The best valuefor lis determined via CV

    There is a nice computational trick ("weakest link

    pruning") that allows this optimal pruning to be done veryfast. See HTF for further discussion.

    Regression Tree Ex (cpus data)

  • 8/11/2019 Notes_7 IE

    90/134

    Regression Tree Ex. (cpus data)

    #do not have to standardize or transform predictors to fit trees

    library(tree)control = tree.control(nobs=nrow(CPUS), mincut = 5, minsize = 10, mindev = 0.002)

    #default is mindev = 0.01, which only gives a 10-node tree

    cpus.tr

  • 8/11/2019 Notes_7 IE

    91/134

    ( ) p ()

    size

    deviance

    10

    20

    30

    4

    0

    2 4 6 8 10 12 14

    24.00 3.80 1.20 0.72 0.41 0.24 0.22

    size

    deviance

    10

    15

    20

    25

    30

    35

    40

    2 4 6 8 10 12 14

    24.00 3.80 1.20 0.72 0.41 0.24 0.22

    deviance vs k (l) from cv.tree()

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    92/134

    Discussion Points and Questions

    What is the best size tree for the CPUS example?

    Provide an interpretation of which predictor variables aremost important

    Do there appear to be any interactions between mmaxand cach?

    Why must minsize be at least twice mincut?

    The "deviance" measure that is plotted versus tree sizeis 2logf(y,q). Why does this correspond to the SSE for anonlinear regression model with normal errors?

    Classification Trees Overview

  • 8/11/2019 Notes_7 IE

    93/134

    Classification Trees Overview

    Fitting and using classification trees with aK-category

    response is similar to fitting and using regression trees. For classification trees, we modelpk(x)= Pr{Y= k| x}

    (k=1,2,. . .,K) as constant over each region

    Compare to regression trees, for which we modelg(x;q)

    =E[Y| x]as constant over each region At each step in the fitting algorithm, the best next split is

    the one that most reduces some criterion measuring theimpurity within the regions

    Classification Trees Some Details

  • 8/11/2019 Notes_7 IE

    94/134

    Classification Trees Some Details

    In the regionRm, the fitted class probabilities and best

    class prediction are:

    (class-ksample fraction in regionRm)

    (most common class in regionRm)

    Some common impurity measures:

    Misclassification error:

    Gini index:

    deviance: (log-likelihood)

    R

    im

    k,mmi

    kyIN

    px

    1

    pmaxargk k,mkm

    M

    m

    mk,mm

    M

    m Rmi

    mi p-NkyI

    11

    1x

    M

    m

    K

    kk,mk,mm p-pN

    1 11

    M

    m

    K

    k

    k,mk,mm plogpN

    1 1

    Example Illustrating the Notation

  • 8/11/2019 Notes_7 IE

    95/134

    Example Illustrating the Notation

    Suppose you haveK= 4classes, and the predictors for

    Nm= 100training cases fall into a particular regionRm.For those 100cases, suppose we have the followingbreakdown of the number of cases with response valuethat fell into the four categories:

    What is for k= 1, 2, 3, 4?

    What is km?

    Class, k # obsvns withYin Class k

    1 10

    2 20

    3 65

    4 5

    p k,m

    p k,m

    K= 2 Class Example Illustrating Notation andS litti B d I it

  • 8/11/2019 Notes_7 IE

    96/134

    Splitting Based on Impurity

    In the following, where would the first split that minimizes

    the misclassification rate be, and what would theand be?

    p k,m

    kmyi

    2

    1

    xi

    Classification Tree Ex. (fgl data)

  • 8/11/2019 Notes_7 IE

    97/134

    ( g )

    library(tree)

    control = tree.control(nobs=nrow(FGL), mincut = 5, minsize = 10, mindev = 0.005)#default is mindev = 0.01, which only gives a 10-node tree

    fgl.tr

  • 8/11/2019 Notes_7 IE

    98/134

    size

    deviance

    200

    220

    240

    260

    28

    2 4 6 8 10 12 14

    size

    deviance

    50

    100

    150

    200

    250

    2 4 6 8 10 12 14

    |Mg < 2.695

    RI < 6.22

    Other Win

    Win

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    99/134

    What is the best tree size for the FGL data?

    Which predictors appear to be the most important, andwhat are their effect(s)?

    If you want a summary measure of the predictive qualityof the tree model, what would it be?

    If you wanted to decide whether the neural network isbetter than the tree for predicting glass type, how wouldyou do this?

    How can you tell what impurity measure R used to fit the

    model?

    Same but for the original 6-category response

  • 8/11/2019 Notes_7 IE

    100/134

    g g y p

    control = tree.control(nobs=nrow(FGL), mincut = 5, minsize = 10, mindev = 0.005)

    #default is mindev = 0.01, which only gives a 10-node tree

    fgl.tr

  • 8/11/2019 Notes_7 IE

    101/134

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    102/134

    You choose the best size the same waychoosing the

    size with the lowest CV deviance (the bestMwas about7)

    Which predictors appear to be the most important, anddoes this seem to agree with the best predictors for the

    2-class tree when we choseM= 3? If you want a summary measure of the predictive quality

    for the 6-class tree model, what would it be?

  • 8/11/2019 Notes_7 IE

    103/134

    Numerical Assessment of Variable Importance

  • 8/11/2019 Notes_7 IE

    104/134

    For a visual assessment of the importance of each

    predictor in a tree, inspect the tree graph (the importanceofxj is reflected by how many times it appears in internalnodes, how close they are to the root node, and thelength of the branch for that split if using type = "p" in

    plot.tree):##Replot 6-class FGL tree with branch lengths proportional to reduction in impurity##

    fgl.tr1

  • 8/11/2019 Notes_7 IE

    105/134

    Use CV to compare trees with any other model

    In the previous regression tree example (cpus data), thebest value of complexity parameter was l = 0.4, whichtranslated toM= 11 terminal nodes.

    To compare a regression tree with a neural network

    model we would: Form a random CV partition (e.g. using the CVInd function)

    Compute the CV SSE for a neural network model with 10 hiddenlayer nodes and l = 0.05 (which CV earlier said was roughly thebest value)

    Compute the CV SSE for a regression tree model with either l =0.4 orM= 11 , using the same partition

    Repeat the previous 3 steps as many times as you can,averaging the results, and select the best model as the one withthe lower average CV SSE

    Return to the Income Data Example

  • 8/11/2019 Notes_7 IE

    106/134

    Reconsider the data in adult_train.csv

    Before, we fit a neural network model for regression, predicting thenumber of hours per week worked. And we also fit a neural networkfor classification, predicting the binary income categorization ( 50k)

    Here, we will fit similar regression and classification models, but

    using trees instead of neural networks

  • 8/11/2019 Notes_7 IE

    107/134

    The Best-sized Regression Tree for INCOME Data

  • 8/11/2019 Notes_7 IE

    108/134

    size

    dev

    iance

    36000

    00

    3800000

    4000000

    4200000

    1 10 20 30 40 50

    380000 24000 7400 5800 4400 3800 2900 2400 -Inf

    |age < 22.5

    age < 18.5

    education.num < 9.5relationship:d

    sex:a

    age < 63.5

    occupation:afghilmn

    occupation:abcfghm

    age < 64.5income:a

    age < 63.5

    workclass:abdg

    income:a occupation:djkln

    23.136.9 27.8 34.3

    37.8 41.228.2

    41.1 44.1 30.1

    43.8 47.1 49.2 55.5

    34.9

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    109/134

    When we include native.country, we get an error because

    no more than 32 categories are allowed for a categoricalpredictor. If you really wanted to include native.country,how would you handle this?

    Relative to the CPUS example, would you increase

    mincut, minsize, and mindev, or decrease them? As always, you should deliberately overgrow the tree and

    then prune it back. How do know if you have overgrownthe tree?

    Comparing the tree to the neural network Which was faster to fit?

    Which had better predictive quality, and how can you tell?

    Which was easier to interpret?

    How about for comparing a tree to a linear regression?

    Try a Classification Tree

  • 8/11/2019 Notes_7 IE

    110/134

    control = tree.control(nobs=nrow(INCOME), mincut = 20, minsize = 40, mindev = 0.0005)

    #default is mindev = 0.01, which only gives a 10-node tree

    Inc.tr

  • 8/11/2019 Notes_7 IE

    111/134

    K-Nearest Neighbors

  • 8/11/2019 Notes_7 IE

    112/134

    A generic nonlinear modeling tool that is extremely

    flexible Perhaps the simplest modeling idea of all

    For simple data sets with large n, small k, and nocategorical predictors, almost as widely used as CART

    Based on the name, can you guess how K-nearestneighbors works?

    Structure of 1-Nearest Neighbors (for regression)

  • 8/11/2019 Notes_7 IE

    113/134

    You need a set of training data {yi, xi: i= 1, 2, . . ., n}, but

    you do not fit a model. For 1-Nearest Neighbors, to predict Yfor a new case

    with predictors x: find the xiin your training set that is the closest neighbor to x

    then take the predicted Yto be the response value for thattraining observation

    Illustration of K-NN for Gas Mileage data

  • 8/11/2019 Notes_7 IE

    114/134

    library(scatterplot3d)

    library(rgl)

    GAS

  • 8/11/2019 Notes_7 IE

    115/134

    -2 -1 0 1 2

    10

    15

    20

    25

    30

    35

    40

    -2

    -1

    0

    1

    2

    3

    Displacement

    Rear_ax

    le_ratio

    Mpg

    -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

    -1

    0

    1

    2

    Displacement

    Rear_axle_ratio

    Calculating Distances to find Nearest Neighbors

  • 8/11/2019 Notes_7 IE

    116/134

    If we want to predict Y(x) for a new case with predictors x,

    the distance between xand the predictors xifor the ithtraining case (i= 1, 2, . . ., n) is measured via

    For 1-nearest neighbor, the prediction of Y(x) is

    where i1(x) = index of closest neighbor of x

    k

    j

    ijjii T

    ii xx,d

    1

    2xxxxxxxx

    yyi

    )(1

    )(x

    x

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    117/134

    You should always standardize your predictors as a first

    stepwhy? How do you handle categorical predictors?

    For 1-nearest neighbor, what would a plot ofversus x look for the gas mileage example with only

    Displacement as a predictor?

    )(xy

    -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

    15

    20

    25

    30

    35

    Displacement

    Mpg

    Structure ofK-Nearest Neighbors (for regression)

  • 8/11/2019 Notes_7 IE

    118/134

    More generally, forK-nearest neighbors, you use exactly

    the same procedure, except you: find theKclosest training xi's to x, and

    then take the predicted Yto be the average response value fortheseKtraining observations:

    where {i1(x), i2(x), . . ., iK(x)} = indices ofKclosest

    neighbors of x

    The tradeoff of using large vs. smallKis exactly theclassic bias/variance tradeoff

    K

    l iy

    Ky

    l1

    )(1

    )(x

    x

    Large Versus Small K (single predictor example)

  • 8/11/2019 Notes_7 IE

    119/134

    Why is the predictor in the left plot high variance and low bias?

    Why is the predictor in the right plot low variance and high bias?

    -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

    15

    20

    25

    30

    35

    Displacement

    Mpg

    K=1 K=20

    -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

    15

    20

    25

    30

    35

    Displacement

    Mpg

    Bias and Variance of K-Nearest Neighbors

  • 8/11/2019 Notes_7 IE

    120/134

    Assume true relationship: Y=g(x,q) + e

    with fixed training xsand e~ i.i.d. (0,e2) (not necessarily normal)

    The predictor for fixed xis:

    MSE =E[( )2] =s2+ Var[ ] +Bias2[ ]

    where

    K

    li

    K

    li

    K

    l i lll K

    ;gK

    yK

    y1

    )(1

    )(1

    )(

    111)( e xxx xx

    )(xyY )(xy )(xy

    K

    yVar s2)( x

    K

    li ;gK

    ;gyEYEyBiasl

    1)(

    1)()()( xxxxx x

    ss

    s 22

    2 1)(

    K

    K

    KyYVar xx

    Another Example of Large Vs. Small K

  • 8/11/2019 Notes_7 IE

    121/134

    This is a classification example from HTF with two response categories

    (blue or orange in the figures) and two predictors. The followingscatterplots arex1vs.x2also showing the decision boundaries for theK-nearest neighbors classifiers with K = 15 and K = 1

    K-NN for CPUS data

  • 8/11/2019 Notes_7 IE

    122/134

    library(yaImpute)

    CPUS

  • 8/11/2019 Notes_7 IE

    123/134

    K=2, training fit K=6, training fit

    0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    fit

    ytest

    0.2 0.4 0.6 0.8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    fit

    ytest

    CV to Choose the Best K

    Nrep

  • 8/11/2019 Notes_7 IE

    124/134

    K

  • 8/11/2019 Notes_7 IE

    125/134

    K=2, n-fold CV K=6, n-fold CV

    0.2 0.4 0.6 0.8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    yhat2

    y

    0.2 0.4 0.6 0.8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    yhat1

    y

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    126/134

    Did the 6-nearest neighbors method do better than the

    best neural network model or the linear regressionmodel?

    How can you tell which predictors have the largest effecton the response in the K-nearest neighbors model?

    What are the parameters of the fitted model?

    K-Nearest Neighbors for Classification

  • 8/11/2019 Notes_7 IE

    127/134

    Like CART models, it is straightforward to use nearest

    neighbors for classification. For binary classification, to predictPr{Y=1 | x}for a new

    case, find theKnearest neighbors as before, and takethe predictedPr{Y=1 | x}to be the fraction of theK

    nearest neighbors havingy= 1responses If we have more than two response categories, we take

    the predicted probability for each category to be thefraction of nearest neighbors with response valuesbelonging to that category.

    K-NN for FGL data

  • 8/11/2019 Notes_7 IE

    128/134

    FGL

  • 8/11/2019 Notes_7 IE

    129/134

    K=10, training fit

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    phat

    jitter(as.numeric(ytest=="Win"),am

    ount=0.05)

    Discussion Points and Questions

  • 8/11/2019 Notes_7 IE

    130/134

    In the preceding, what was the training misclassification

    rate? What would happen to the training misclassificationrate if we decreased K? What would it be for K = 1?

    To find the best K for K-nearest neighbors forclassification problems, you must use CV, just like for

    any other method. What CV measure would you use to find the best K for

    the FGL data?

    Is finding the CV errors for K-nearest neighbors

    substantially more computationally expensive thanfinding the training errors, like it is for all of the othermethods we have covered?

    Pros and Cons of Nearest NeighborsPros:

  • 8/11/2019 Notes_7 IE

    131/134

    Pros:

    The most flexible of allcan represent any nonlinear relationship (as

    long as you have sufficiently large n). Easy to use. No model fitting required

    Cons:

    There is no real fitted model (nor even any indication of which predictorsare most important), so not suitable for interpretation or explanatorypurposes.

    Because there is no fitted model, you need to retain all the training datato predict.

    With large k(the number of predictors), you need very large n, becauseneighbors get further away in higher dimensions.

    For most supervised learning methods, large nincreases the

    computational expense for training, but not for new case prediction.Large nis more problematic for nearest neighbors, because the"training" occurs for every new case prediction

    With very large n, we need computational tricks (e.g. tree-basedmethods) to efficiently search for nearest neighbors.

    Not well suited for categorical predictors

    Effect of dimension (k) on distance between neighborsk= 1

    Scatterplot of Rear axle ratio vs Displacement

    k= 2

  • 8/11/2019 Notes_7 IE

    132/134

    6005004003002001000

    0

    Displacement

    500400300200100

    4.5

    4.0

    3.5

    3.0

    2.5

    Displacement

    Rear_

    axle_

    ratio

    Scatterplot of Rear_axle_ratio vs Displacement

    k= 3

    4.0

    7.5 3.5

    8.0

    8.5

    3.0100

    9.0

    250 2.5400

    550

    Comp_ratio

    Rear_axle_ratio

    Displacement

    3D Scatterplot of Comp_ratio vs Rear_axle_ratio vs Displacement

    Software Implementation in Matlab (in case youwant to know)

  • 8/11/2019 Notes_7 IE

    133/134

    Neural Networks: Neural networks toolbox (IE

    computer lab does NOT have NN toolbox) CART: CLASSREGTREE (part of the Stats toolbox)

    Nearest Neighbors: No model to fit. Easy to write yourown script in Matlab.

    Cross-Validation: CROSSVAL (part of the Statstoolbox). You must write an appropriate function call foryour specific model.

    Bootstrapping: BOOTSTRP (part of the Stats toolbox).

    You must write an appropriate function call for yourspecific model.

    Some Other "Data Mining" Tools

  • 8/11/2019 Notes_7 IE

    134/134

    Two big categories of problems:

    supervised learning (we have a response Yand predictors xandwant to model Yas a function of x).

    unsupervised learning (we have no Y; just an xand we want tofind relationships among elements of x)

    IEMS 304 covered the foundations and primary tools ofsupervised learning. There are many more advancedmethods, but most are extensions of what we havealready covered

    Examples of unsupervised learning:

    clustering

    association rules