Eric P. Xing, Michael I. Jordan, & Richard M. Karp Apr...

1

1

�� !��"� �$#%� ��&� � ��&��'(�)��

Eric P. Xing, Michael I. Jordan, & Richard M. Karp

@ Division of Computer Science, UC Berkeley

Presented by Degui Zhi @ UCSD

Apr 30th 2002

2

*,+ -/. 0213. . 146587)+ 029 02:�4;+ <>=@?�9 + A)B• Human genome contains ~30K genes• Not all genes are expressed at same time

• A microarray is a systematic way to test the expression levels of thousands of genes in a single experiment– A “snapshot” or a state vector of a cell/tissue

• Multiple microarrays: compare expression levels under different conditions:– Binary (normal vs. diseased; treated vs. untreated) – Multi-class (different types of cancers, populations) – Continuous (time course; dose response)

2

3

C�D�E)B/.�+ FGB�<$H�?I2J3BLK(M�B�FG+ 1;A�13H�1,?�B/H

• Data are drawn from Golub et al. 1999.

• 7130 genes in a microarray (6817 in Golub’s)

34

14

20 (59%)

Test set

72

25

47

Total

38Total

11AML

27 (71%)ALL

Training setLeukemia Type

• Training set: same tissue, same age, and same lab

• Test set: different tissue, different age, and different lab

7130 genes72 experiments

4

N). B�A)+ -LH�+ 02<O7P4;QR09 K(7,B�H%1P9&S• Feature selection: rank by Pearson correlation to class label

– 1100 genes with significant correlation

– 50 gene with highest correlation used for classfication

• Classification: simple linear classifier, sign(w x)

– Training: cross validation success for 36 of the 38 samples

– Test: success for 29 of the 34 independent samples

– The other samples are uncertain due to lack of significance

– Predictors made of top 10-200 genes all can be trained to make no mistake.

3

5

TUB2?VKU9 H%02W$QX029 KU7,B�H%139Training set Independent test set

6

*,1P-/YZ+ <ZB@9 B/1P.�<$+ <Z:6+ <OFG+ -/. 021P.�. 1�43 inter-connected questions

• Feature selection

– Eliminate genes that are irrelevant or redundant

• Clustering genes

– Group genes that are expressed together

• Classification

– Predict gene classes – classify columns

– Predict tissue types – classify rows

• More ambitious: genetic network modeling

4

7

[\B/1PH&KU. B]?^B�9 B�-LH�+ 02<• The concept to learn is F→{ 0,1}

• |F| is too large, so we want to find a small but informative subset G⊆F and learn G →{ 0,1} instead

2 popular approaches for feature selection

• Wrapper

1. Find a feature subset G;

2. Optimizing the classifier C for G, measure the error _ (C(G))

3. Find G = argmin _ (C(G))

• Filter

Find a feature subset G independent of any classifier C.

8

`RBLKU. + ?�H�+ -@W 02.aW B�1PHK(. B8?�B/9 B/-/H�+ 02<

Filter using

Markov blanket

Genes are highly redundant

Ranking by

information gain

Not all genes respond to a single event

Testing of

bimodal distribution

Gene expression is

‘on’ or ‘off’

Feature selection filterBiological knowledge (assumption)

5

9

b KUH�9 + <ZB@02W\E).�02-LB�APK(. B

Mixture of Gaussians test

All features

Rank by information gain

Filter using Markov blanket

Classification

Feature selection

Selectedfeatures

10

b KUH�9 + <ZB@02W\E).�02-LB�APK(. B

Mixture of Gaussian Test

All features

Rank by information gain

Filter using Markov Blanket

Classification

Feature selection

Selectedfeatures

Regular ization

All features,weighted

6

11

[\B�1PHK(. B@FG02A)B/9 + <Z:�IP7)+ FG02A�139• Heuristic: a feature with discriminative power should have

bimodal distribution

• A simple bimodal model: mixture of 2 univariate Gaussians

Normalized expression level Normalized expression level

12

• For a feature F, we have measurements f={ f1, … , fN}

• Mixture of K univariate Gaussians with parameter

• The likelihood of fn to the k-th Gaussians is:

and

QX1K�?/?�+ 1P<OFG+ D�HKU. B2?

c de

fg −−=

2

2

2

)f(exp

2

1),|f(

k

kn

k

n kPσµ

σπh

prior class is 1)},,{ ( kkkk Kk πσµπ ≤≤=i

• Learn j from the sample data using EM

),|f()|f( kPPk

nkn kk l= π

7

13

• The mixture overlap m is the minimal error achievable by any classifier h(·) on this EM-trained Gaussian mixture model

• m can be used as a measure of the discriminability of feature F

• h(·) can be used to quantize continuous value f i, which is used for later filters that are based on information theory.

n,o p�qrUs t6u�v�t�s w xPy

14

• For a reference partition Q={ S0, S1} of the training set S

• Entropy of this partition

• Test on feature F induces partition E ={ E0, E1} .

• Partition Q projected onto Ek forms subpartition

z\t�xPqr(s t8{�t/w t/|/q�o u2}~vo x�o }Z� u2s �GxPq�o u2}O��x3o }

)(log)()(}1,0{

cc

c SPSPQH �=

−=

} }1,0{ ,{ ∈∩ cES=Q kck

• Entropy of Qk

)|(log)|()(1

kc

C

ckck ESPESPQH �

=

−=

8

15

• The information gain due to F w.r.t. the reference partition is

�(�P�Z�)� �Z�6�P�;� �Z� �2��G�P�� 2�O��P� �

)|()()|( EQHQHEQI −=

))()(()(1

�=

−=K

kkk QHEPQH

H( )=-2* [1/2* log(1/2)]=1

H( )=(1/2)*H( )+(1/2)*H( )=0.9188

H( )=1 H( )=0

< <• Ranking by infogain:

16

�U��3�U�Z�)�P�Z�%� �/�P�&�U� �

Let F be the full feature set and G ⊆ F. Let C be the class label.

• A feature F i is redundant in G if the classification results are same with or without it.

• That is conditional independence:

P(C|G – {F i}) = P(C|G)

for all values of the features in G.

• To be more precise, if there is a subset M ⊆ G, F i not in M , but

P(C|M – {F i}) = P(C|M )

• M is called a Markov blanket of F i

9

17

�,�3� �)��8�)� �3�Z��/�U� � � �� Z�� &�� /¡

For a complete feature set F, let G be a subset of F, and

G′′′′=G – {F i} . If M ⊆⊆⊆⊆ G is a Markov Blanket of F i, then

where ¢ is any divergence function between 2 pdf’s.

If ¢ is the expected KL divergence

£ ¤��¥ � ¦§&� � �� Once we find a Markov blanket F i in G, we can safely remove F i from G without increasing the divergence ¨�¥ ©�� ª�¤2«

Iteratively remove a feature if it has a Markov blanket.

),(),( FGFG ∆=′∆

)}|(||)|(({),( GFFGF

CPCPDΕ=∆

18

¬())® ¯/°�± ²G³3´�µ@¶,³P®�·�¯�¸8¹)º ³P»Z·)µ/´Practically, we only search for Markov blankets of limited size.

It is still expensive to find an exact Markov blanket. ¼�½ � ¾�� ¿§�� «If M is really a Markov blanket for F i, then for any feature value fi,

¨��&� �&À�� ¤�§&� � �� «Find M so that the following quality (expected KL divergence) is small

))}|(||),|(({)|( MMM CPfFCPDF iif

ii

=Ε=δ

0))|(||),|(( == MM CPfFCPD ii

10

19

Á(Â)Â)Ã Ä/Å�Æ ÇGÈ3É�Ê@Ë,ÌÍÈPÎ Ï�Ä2Ã�Æ É�ÐZÇ

Initialize

G = F

Iterate

For each feature F i ∈ G,

let M (F i) be the set of k features F j ∈ G –{ F i} having highest correlation with F i

Compute Ñ (F i | M (F i) ) for each F i ∈ GChoose the F i = argminF Ñ (F i | M (F i) )

Update G := G – { F i }

20

Ò(Ó Ô�Õ/Õ�Ö × Ö Ø/ÔPÙ�Ö ÚÛOÔ3Ó Ü�Ú2Ý Ö Ù�ÞZßàÕ3 classifiers are applied after filtering

• Multivariate Gaussian classifier

• Logistic regression

• K nearest neighbor

11

21

áRÔâ"Õ/Õ�Ö ÔPÛOØ/Ó Ô�Õ/Õ�Ö × Ö ã�Ý• A Gaussian classifier is a generative classifier assuming data

distributed as a mixture of c Gaussians

• The model ä consists of a prior probability å c for each class c, having class-conditional density N(æ c,ç c)

• For binary case C=2, the ratio of posterior probabilities is

è(éVê ë�ì�í ë�î2ïwhere ð , ñ , ò are functions of model parameters ó c, ô c,õ c.

}||2/)()'(exp{

}||2/)()'(exp{

),|0(

),|1(2

000

2111 ö÷x

ö÷x

ö÷xö÷xø

x

øx

−−−−−−=

===

c

c

yP

yPr

γ−−Σ= xù

xx ''

2

1logr

22

úRûü"ý/ý�þ ûPÿ�� û�ý/ý�þ � þ ��/þ ý�þ ÿ�� üUÿ�)û��• The decision boundary is a quadratic surface in the feature

space

-50 -40 -30 -20 -10 0 10 20 30 40 50-50

-40

-30

-20

-10

0

10

20

30

40

50

-50 -40 -30 -20 -10 0 10 20 30 40 50-50

-40

-30

-20

-10

0

10

20

30

40

50

12

23

� ��þ ý��þ ��2ýLý�þ 2ÿ• Logistic regression is a discriminative classifier. The parameter �

is a weight vector for x.

• Geometrically, this classifier corresponds a sigmoid-shape ramp at the edge of the decision hyperplane.

•�

can be estimated by stochastic gradient ascent:

where

}exp{1

1)|1(

'x�x

−+==yp

}exp{1

1ˆ

'n

nyx�

−+=

nnn yy x��

)ˆ(: −+= ρ

24

� ��!#"$ %�&�' (#%�)*' +�,.-�/10�23��%�)4"&��5/��$ ��6�)7' +�,

Problem: Too many features implies too much complexity in the hypothesis space

Solutions:

• Feature selection: reduces the number of features

• Regularization: constrains the norm of parameters.

• Regularization can cope with overfitting

• Feature selection is easier to compute and interpret

13

25

� ��!#"$ %�&�' (#%�)*' +�,.-�/10�23��%�)4"&��5/��$ ��6�)7' +�,

In a maximum likelihood setting, learning 8 given data set 9• Without regularization,

• With regularization,

• In case of L2 norm, we obtain the stochastic gradient to estimate parameter

))ˆ((: :x:: λρ −−+= nnn yy

})|({maxargˆ ;;; λ−= <l

)}|({maxargˆ =>>l=

L1 or L2 norm

Regularization parameter

26

?3@ A B7C�D�@ E�F�D�C1GIHA B�J�KL@ MNB4HD�CPO�Q�C�D�A R�S

14

27

?3@ A B7C�D�@ E�F�D�C1GIHA B�J�@ E�T O�D�KLR�B7@ O�E�F�R�@ E

Most informative

28

U VWO�T3B*O�S.X3YNZ[F�C�E\C�G^]�_`@ E�T O�F�R�@ E

Most redundant

15

29

a A R�G�G�@ T @ b�R�B*@ O�E�D�C�GIHA B

3 errors

008LR

1012Gaussian

0040kNN

TestTraining|G|Classifier

Best per formance with feature selection (# errors)

30

VcC�R1Q1C1d7O1E�C1d7O�H Beb�D�OcG�GfQ1RcA @ g�R�B7@ O�E

• Best number of features to be used for a classifier is chosen byminimizing the leave-one-out cross validation error

5.9% (2/34)

kNN

08.8% (3/34)

Logistic RegressionGaussian

• Choose smallest |G| that gives 0 error and the test error rates are:

16

31

?\@ A B7C�Dih3Q�G1j�?3@ A B7C�Dih4h

• Leave-one-out cross validation error for logistic regression with

Filter II onlyFilter I only

4 errors

1 error

32

?3C�R�BkHD�C5GlC�A C�b#B*@ O�E.Q�G1j�D�C�F#H A RcD�@ m�R�B7@ O1E

• Gaussian classifier using all features with L1 or L2 penalties.

17

33

?3C�R�BkHD�C5GlC�A C�b#B*@ O�E.Q�G1j�D�C�F#H A RcD�@ m�R�B7@ O1E

• Gaussian classifier using all features with L1 or L2 penalties.

34

n HKLKLRcD*_`RcE�g[B*o\O�HF1o\BpG

• Microarray data create challenge to machine learning due to many features but few samples

• Feature selection or regularization is preferred before learning

• A series of filtering feature selection methods based on information theory are proposed in this paper

• The experimental design could better if:

– Put more focus on feature selection

– Compare with other’s work

– Use information theory based classifier

– Redundant features could be used for cross validation

18

35

Eric P. Xing, Michael I. Jordan, & Richard M. Karp Apr...

Documents

Transcript of Eric P. Xing, Michael I. Jordan, & Richard M. Karp Apr...