Eric P. Xing, Michael I. Jordan, & Richard M. Karp Apr...
Transcript of Eric P. Xing, Michael I. Jordan, & Richard M. Karp Apr...
1
1
������� �������� ����� � ����� ������ ���� ��� ��������� ����� !��������"� �$#%� ���&� � ��&��'(�)��� �
Eric P. Xing, Michael I. Jordan, & Richard M. Karp
@ Division of Computer Science, UC Berkeley
Presented by Degui Zhi @ UCSD
Apr 30th 2002
2
*,+ -/. 0213. . 146587)+ 029 02:�4;+ <>=@?�9 + A)B• Human genome contains ~30K genes• Not all genes are expressed at same time
• A microarray is a systematic way to test the expression levels of thousands of genes in a single experiment– A “snapshot” or a state vector of a cell/tissue
• Multiple microarrays: compare expression levels under different conditions:– Binary (normal vs. diseased; treated vs. untreated) – Multi-class (different types of cancers, populations) – Continuous (time course; dose response)
2
3
C�D�E)B/.�+ FGB�<$H�?I2J3BLK(M�B�FG+ 1;A�13H�1,?�B/H
• Data are drawn from Golub et al. 1999.
• 7130 genes in a microarray (6817 in Golub’s)
34
14
20 (59%)
Test set
72
25
47
Total
38Total
11AML
27 (71%)ALL
Training setLeukemia Type
• Training set: same tissue, same age, and same lab
• Test set: different tissue, different age, and different lab
7130 genes72 experiments
4
N). B�A)+ -LH�+ 02<O7P4;QR09 K(7,B�H%1P9&S• Feature selection: rank by Pearson correlation to class label
– 1100 genes with significant correlation
– 50 gene with highest correlation used for classfication
• Classification: simple linear classifier, sign(w x)
– Training: cross validation success for 36 of the 38 samples
– Test: success for 29 of the 34 independent samples
– The other samples are uncertain due to lack of significance
– Predictors made of top 10-200 genes all can be trained to make no mistake.
3
5
TUB2?VKU9 H%02W$QX029 KU7,B�H%139Training set Independent test set
6
*,1P-/YZ+ <ZB@9 B/1P.�<$+ <Z:6+ <OFG+ -/. 021P.�. 1�43 inter-connected questions
• Feature selection
– Eliminate genes that are irrelevant or redundant
• Clustering genes
– Group genes that are expressed together
• Classification
– Predict gene classes – classify columns
– Predict tissue types – classify rows
• More ambitious: genetic network modeling
4
7
[\B/1PH&KU. B]?^B�9 B�-LH�+ 02<• The concept to learn is F→{ 0,1}
• |F| is too large, so we want to find a small but informative subset G⊆F and learn G →{ 0,1} instead
2 popular approaches for feature selection
• Wrapper
1. Find a feature subset G;
2. Optimizing the classifier C for G, measure the error _ (C(G))
3. Find G = argmin _ (C(G))
• Filter
Find a feature subset G independent of any classifier C.
8
`RBLKU. + ?�H�+ -@W 02.aW B�1PHK(. B8?�B/9 B/-/H�+ 02<
Filter using
Markov blanket
Genes are highly redundant
Ranking by
information gain
Not all genes respond to a single event
Testing of
bimodal distribution
Gene expression is
‘on’ or ‘off’
Feature selection filterBiological knowledge (assumption)
5
9
b KUH�9 + <ZB@02W\E).�02-LB�APK(. B
Mixture of Gaussians test
All features
Rank by information gain
Filter using Markov blanket
Classification
Feature selection
Selectedfeatures
10
b KUH�9 + <ZB@02W\E).�02-LB�APK(. B
Mixture of Gaussian Test
All features
Rank by information gain
Filter using Markov Blanket
Classification
Feature selection
Selectedfeatures
Regular ization
All features,weighted
6
11
[\B�1PHK(. B@FG02A)B/9 + <Z:�IP7)+ FG02A�139• Heuristic: a feature with discriminative power should have
bimodal distribution
• A simple bimodal model: mixture of 2 univariate Gaussians
Normalized expression level Normalized expression level
12
• For a feature F, we have measurements f={ f1, … , fN}
• Mixture of K univariate Gaussians with parameter
• The likelihood of fn to the k-th Gaussians is:
and
QX1K�?/?�+ 1P<OFG+ D�HKU. B2?
c de
fg −−=
2
2
2
)f(exp
2
1),|f(
k
kn
k
n kPσµ
σπh
prior class is 1)},,{ ( kkkk Kk πσµπ ≤≤=i
• Learn j from the sample data using EM
),|f()|f( kPPk
nkn kk l= π
7
13
• The mixture overlap m is the minimal error achievable by any classifier h(·) on this EM-trained Gaussian mixture model
• m can be used as a measure of the discriminability of feature F
• h(·) can be used to quantize continuous value f i, which is used for later filters that are based on information theory.
n,o p�qrUs t6u�v�t�s w xPy
14
• For a reference partition Q={ S0, S1} of the training set S
• Entropy of this partition
• Test on feature F induces partition E ={ E0, E1} .
• Partition Q projected onto Ek forms subpartition
z\t�xPqr(s t8{�t/w t/|/q�o u2}~vo x�o }Z� u2s �GxPq�o u2}O��x3o }
)(log)()(}1,0{
cc
c SPSPQH �=
−=
} }1,0{ ,{ ∈∩ cES=Q kck
• Entropy of Qk
)|(log)|()(1
kc
C
ckck ESPESPQH �
=
−=
8
15
• The information gain due to F w.r.t. the reference partition is
�(�P�Z�)� �Z�6�P�;� �Z� �2���G�P��� �2�O���P� �
)|()()|( EQHQHEQI −=
))()(()(1
�=
−=K
kkk QHEPQH
H( )=-2* [1/2* log(1/2)]=1
H( )=(1/2)*H( )+(1/2)*H( )=0.9188
H( )=1 H( )=0
< <• Ranking by infogain:
16
�U���3�U�Z�)�P�Z�%� �/�P�&�U� �
Let F be the full feature set and G ⊆ F. Let C be the class label.
• A feature F i is redundant in G if the classification results are same with or without it.
• That is conditional independence:
P(C|G – {F i}) = P(C|G)
for all values of the features in G.
• To be more precise, if there is a subset M ⊆ G, F i not in M , but
P(C|M – {F i}) = P(C|M )
• M is called a Markov blanket of F i
9
17
�,�3� �)���8�)� �3�Z���/�U� � � ����� � �Z�� � ����&��� � � �� /¡
For a complete feature set F, let G be a subset of F, and
G′′′′=G – {F i} . If M ⊆⊆⊆⊆ G is a Markov Blanket of F i, then
where ¢ is any divergence function between 2 pdf’s.
If ¢ is the expected KL divergence
£ ¤��¥ � ¦§&� � �� Once we find a Markov blanket F i in G, we can safely remove F i from G without increasing the divergence ¨�¥ ©���� � � ª�¤2«
Iteratively remove a feature if it has a Markov blanket.
),(),( FGFG ∆=′∆
)}|(||)|(({),( GFFGF
CPCPDΕ=∆
18
¬())® ¯/°�± ²G³3´�µ@¶,³P®�·�¯�¸8¹)º ³P»Z·)µ/´Practically, we only search for Markov blankets of limited size.
It is still expensive to find an exact Markov blanket. ¼�½ � ¾�� ¿§�� � �� �«If M is really a Markov blanket for F i, then for any feature value fi,
¨����&� �&À�� ¤�§&� � �� �«Find M so that the following quality (expected KL divergence) is small
))}|(||),|(({)|( MMM CPfFCPDF iif
ii
=Ε=δ
0))|(||),|(( == MM CPfFCPD ii
10
19
Á(Â)Â)à Ä/Å�Æ ÇGÈ3É�Ê@Ë,ÌÍÈPÎ Ï�Ä2Ã�Æ É�ÐZÇ
Initialize
G = F
Iterate
For each feature F i ∈ G,
let M (F i) be the set of k features F j ∈ G –{ F i} having highest correlation with F i
Compute Ñ (F i | M (F i) ) for each F i ∈ GChoose the F i = argminF Ñ (F i | M (F i) )
Update G := G – { F i }
20
Ò(Ó Ô�Õ/Õ�Ö × Ö Ø/ÔPÙ�Ö ÚÛOÔ3Ó Ü�Ú2Ý Ö Ù�ÞZßàÕ3 classifiers are applied after filtering
• Multivariate Gaussian classifier
• Logistic regression
• K nearest neighbor
11
21
áRÔâ"Õ/Õ�Ö ÔPÛOØ/Ó Ô�Õ/Õ�Ö × Ö ã�Ý• A Gaussian classifier is a generative classifier assuming data
distributed as a mixture of c Gaussians
• The model ä consists of a prior probability å c for each class c, having class-conditional density N(æ c,ç c)
• For binary case C=2, the ratio of posterior probabilities is
è(éVê ë�ì�í ë�î2ïwhere ð , ñ , ò are functions of model parameters ó c, ô c,õ c.
}||2/)()'(exp{
}||2/)()'(exp{
),|0(
),|1(2
000
2111 ö÷x
ö÷x
ö÷xö÷xø
x
øx
−−−−−−=
===
c
c
yP
yPr
γ−−Σ= xù
xx ''
2
1logr
22
úRûü"ý/ý�þ ûPÿ���� û�ý/ý�þ � þ ���������/þ ý�þ ÿ��� �üUÿ�)û����• The decision boundary is a quadratic surface in the feature
space
-50 -40 -30 -20 -10 0 10 20 30 40 50-50
-40
-30
-20
-10
0
10
20
30
40
50
-50 -40 -30 -20 -10 0 10 20 30 40 50-50
-40
-30
-20
-10
0
10
20
30
40
50
12
23
� ��þ ý���þ �����������2ýLý�þ 2ÿ• Logistic regression is a discriminative classifier. The parameter �
is a weight vector for x.
• Geometrically, this classifier corresponds a sigmoid-shape ramp at the edge of the decision hyperplane.
•�
can be estimated by stochastic gradient ascent:
where
}exp{1
1)|1(
'x�x
−+==yp
}exp{1
1ˆ
'n
nyx�
−+=
nnn yy x��
)ˆ(: −+= ρ
24
� ��!#"$ %�&�' (#%�)*' +�,.-�/10�23��%�)4"&��5/���$ ��6�)7' +�,
Problem: Too many features implies too much complexity in the hypothesis space
Solutions:
• Feature selection: reduces the number of features
• Regularization: constrains the norm of parameters.
• Regularization can cope with overfitting
• Feature selection is easier to compute and interpret
13
25
� ��!#"$ %�&�' (#%�)*' +�,.-�/10�23��%�)4"&��5/���$ ��6�)7' +�,
In a maximum likelihood setting, learning 8 given data set 9• Without regularization,
• With regularization,
• In case of L2 norm, we obtain the stochastic gradient to estimate parameter
))ˆ((: :x:: λρ −−+= nnn yy
})|({maxargˆ ;;; λ−= <l
)}|({maxargˆ =>>l=
L1 or L2 norm
Regularization parameter
26
?3@ A B7C�D�@ E�F�D�C1GIHA B�J�KL@ MNB4HD�CPO�Q�C�D�A R�S
14
27
?3@ A B7C�D�@ E�F�D�C1GIHA B�J�@ E�T O�D�KLR�B7@ O�E�F�R�@ E
Most informative
28
U VWO�T3B*O�S.X3YNZ[F�C�E\C�G^]�_`@ E�T O�F�R�@ E
Most redundant
15
29
a A R�G�G�@ T @ b�R�B*@ O�E�D�C�GIHA B
3 errors
008LR
1012Gaussian
0040kNN
TestTraining|G|Classifier
Best per formance with feature selection (# errors)
30
VcC�R1Q1C1d7O1E�C1d7O�H Beb�D�OcG�GfQ1RcA @ g�R�B7@ O�E
• Best number of features to be used for a classifier is chosen byminimizing the leave-one-out cross validation error
5.9% (2/34)
kNN
08.8% (3/34)
Logistic RegressionGaussian
• Choose smallest |G| that gives 0 error and the test error rates are:
16
31
?\@ A B7C�Dih3Q�G1j�?3@ A B7C�Dih4h
• Leave-one-out cross validation error for logistic regression with
Filter II onlyFilter I only
4 errors
1 error
32
?3C�R�BkHD�C5GlC�A C�b#B*@ O�E.Q�G1j�D�C�F#H A RcD�@ m�R�B7@ O1E
• Gaussian classifier using all features with L1 or L2 penalties.
17
33
?3C�R�BkHD�C5GlC�A C�b#B*@ O�E.Q�G1j�D�C�F#H A RcD�@ m�R�B7@ O1E
• Gaussian classifier using all features with L1 or L2 penalties.
34
n HKLKLRcD*_`RcE�g[B*o\O�HF1o\BpG
• Microarray data create challenge to machine learning due to many features but few samples
• Feature selection or regularization is preferred before learning
• A series of filtering feature selection methods based on information theory are proposed in this paper
• The experimental design could better if:
– Put more focus on feature selection
– Compare with other’s work
– Use information theory based classifier
– Redundant features could be used for cross validation
18
35