The Human gut flora - opengpu.netopengpu.net/EN/attachments/154_HiPEAC2012_OpenGPU_INRA-AS+… ·...
Transcript of The Human gut flora - opengpu.netopengpu.net/EN/attachments/154_HiPEAC2012_OpenGPU_INRA-AS+… ·...
27/02/2012
1
HIPEAC 2012 / HIPEAC 2012 / OpenGPUOpenGPU WorkshopWorkshop JanuaryJanuary 2525thth 20122012
MetaProfMetaProf a large scale clustering example with the TGCC CURIE HPC clustera large scale clustering example with the TGCC CURIE HPC cluster
F. Boumezbeur (INRA), Dany Tello, Vincent Ducrot (AS+)F. Boumezbeur (INRA), Dany Tello, Vincent Ducrot (AS+)
MICALIS MICALIS ((MIMIcrobiologiecrobiologie de la de la CChaîne haîne ALIALImentairementaire au service de la au service de la SSanté)anté)
INRA CRJ - 78350 Jouy-en-Josas (France)
AS+ - 22 rue René Coche 92170 Vanves
The Human gut flora Human intestinal bacterial flora contains ~1000 ~1000 bacterialbacterial speciesspecies. Dominant bacteria are present in high abundance.
WhatWhat speciesspecies are are therethere ?? Some are beneficialbeneficial, but others seem to be associated with intestinal intestinal disordersdisorders such as inflammation inflammation diseasesdiseases and obesityobesity.
To To whatwhat extentextent ??
~500 species found in the human gut are being sequenced (cultivated strains).
The The majoritymajority of the of the gutgut bacteriabacteria are are uncultivableuncultivable..
What about uncultivable species ?
The Human gut flora The Human gut flora
« MetaHITMetaHIT » project : A 3.3 million 3.3 million genegene catalogcatalog accounting for
80% of the 80% of the floraflora diversitydiversity has been established in 2010.
October 2011 : the MetaHIT catalog has been updated
3.9 millions of 3.9 millions of genesgenes.
((150150--foldfold more genes than the human genomemore genes than the human genome))
50% of the 50% of the genesgenes couldn’tcouldn’t bebe assignedassigned to to knownknown speciesspecies
JunjieJunjie Qin et al, 2010Qin et al, 2010
The two catalogs contains information on uncultivableuncultivable speciesspecies
The Human gut flora
« MetaHITMetaHIT » project : A 33..3 3 million million genegene catalogcatalog accounting for
8080% of the % of the floraflora diversitydiversity has been established in 2010.
October 2011 : the MetaHIT catalog has been updated
33..9 9 millions of millions of genesgenes.
((150150--foldfold more genes than the human genomemore genes than the human genome))
5050% of the % of the genesgenes couldn’tcouldn’t bebe assignedassigned to to knownknown speciesspecies
JunjieJunjie Qin et al, 2010Qin et al, 2010
But But bothboth of of themthem are not are not structuredstructured in in speciesspecies … … yetyet !!
The two catalogs contains information on uncultivableuncultivable speciesspecies
A successful attempt to describe the human gut flora
ManimozhiyanManimozhiyan ArumugamArumugam et al., et al., 20112011
Quantitative Quantitative metagenomicsmetagenomics relyingrelying on on knownknown bacterialbacterial genomesgenomes
27/02/2012
2
Quantitative metagenomics pipeline
MetabolomicsMetabolomics
EcologyEcology
StatisticsStatistics
IdentificationIdentification
QuantificationQuantification
PolymorphismPolymorphism iMOMiiMOMi
(interactive (interactive MetaOmicsMetaOmics MiningMining))
NGS data = DNA
MetaHITMetaHIT genegene cataloguecatalogue
~ 500 analysed samples during the last 24 months
200.000+ files (20+ To20+ To !)
Acquisition of 2 machines SOLiDSOLiD 55005500
Over Over 22--foldfold increaseincrease in data flow !in data flow !
An avalanche of NGS data!
Human gut bacteria
1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 ….
1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 ….
High through-put sequencing (HTS)
Relative Relative abundancesabundances per per samplesample
Identification & Quantification
Human gut bacteria
1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 ….
Classification of Classification of genesgenes
in in correlatedcorrelated abundanceabundance groupsgroups
Relative Relative abundancesabundances per per samplesample
ClusteringClustering
The Human gut flora : input data
Relative Relative abundanceabundance relyingrelying on the on the MetaHITMetaHIT genegene catalogue : catalogue : eacheach line line isis a a genegene count count vectorvector eacheach columncolumn isis a a samplesample count count vectorvector
2D 2D matrixmatrix of of floatingfloating point valuespoint values … … withwith lots of lots of zeroszeros !!
27/02/2012
3
Quantitative metagenomics pipeline
Statistic Statistic Statistic Statistic analysis analysis analysis analysis
Sequencing output Sample 1 Sample 2 Sample n
Sample i Quality control Quality control Statistics
Preliminary analysis Filtered data from
sample i Filtered data from
sample i Diagnostic
Reference Data Bank
Reads Reads assembling
ic database
DNA Metagenomic database
Annotation Annotation
(NR)
Functional database
(NR)
iMOMi Database &
Tools
Establishment of gene sets or
under study
Establishment of gene sets or specie-like entities for the system
under study
Statistic Statistic analysis
Mapping with Mapping with references
Mapping criterions (Mismatches parameters)
Mapping criterions (Mismatches parameters)
Gene Gene
matrixmatrix
Gene Gene count count matrixmatrix
Gene sets VS samples
Usa
ble
inte
rmed
iate
dat
a
…
Sets of Sets of
GenesGenes
Sets of Sets of uniqueunique--speciesspecies
GenesGenes
Quantitative metagenomics pipeline
Statistic Statistic Statistic Statistic analysis analysis analysis analysis
Sequencing output Sample 1 Sample 2 Sample n
Sample i Quality control Quality control Statistics
Preliminary analysis Filtered data from
sample i Filtered data from
sample i Diagnostic
Reference Data Bank
Reads Reads assembling
ic database
DNA Metagenomic database
Annotation Annotation
(NR)
Functional database
(NR)
iMOMi Database &
Tools
Establishment of gene sets or
under study
Establishment of gene sets or specie-like entities for the system
under study
Statistic Statistic analysis
Mapping with Mapping with references
Mapping criterions (Mismatches parameters)
Mapping criterions (Mismatches parameters)
Gene Gene
matrixmatrix
Gene Gene count count matrixmatrix
Gene sets VS samples
Usa
ble
inte
rmed
iate
dat
a
…
Sets of Sets of
GenesGenes
Sets of Sets of uniqueunique--speciesspecies
GenesGenes
MetaProfMetaProf MetaProfMetaProf
Iterative and incremental development
Literate Programming
a Paira Pair--wisewise Spearman Spearman correlationcorrelation calculatorcalculator
MetaProf (Metagenomic Profiles)
Gene i Gene j Correlation Coefficient
1 2 0.153642
1 3 0.252210
1 4 0.166666
…
1 3 312 399 0.8999990.899999
2 3 0.009781
…
1st 1st MetaHITMetaHIT genegene catalogue:catalogue:
3,3 millions (3,3 millions (3,3x103,3x1066) ) genesgenes
5500 billions (5500 billions (5,5x5,5x10101212) ) correlationscorrelations
MetaProf (Metagenomic Profiles)
2012
2011
2010
2009 V5.0 Sequential
V5.1 Sequential optimized
V5.2 OpenMP
V5.3 MPI/OpenMP
V6 OpenCL
V7 Cuda
V7.1 MPI/Cuda/GT200
V7.2 MPI/Cuda/GF100
MetaProf Timeline Version Hardware platform Programming
model Speedup for 100
000 genes 400 samples
Expected duration for 3M
genes
MetaProf V5.2 Single Node 2 x Intel Xeon X5650
Westmere 6 cores
OpenMP 3.5 18 days
Metaprof V5.3 4 nodes 4x Intel Xeon E5450
Harpertown 4 cores each
MPI+OpenMP 7 4 days
Metaprof V7.0
Single node 1 x Nvidia C1060
Cuda 9,3 3 days
Metaprof V7.1
TGCC/Titane 192 nodes
2 x Nvidia S1070 each
MPI+Cuda
Metaprof V7.2
TGCC/Curie 144 nodes
2x Nvidia 2050 each
MPI+Cuda
MetaProf roadmap
To be detailed
27/02/2012
4
MPI + CUDA implementation
• In most recent studies 3 300 000 genes have to be processed.
Requirements for a faster
implementation
• Data distribution between nodes.
• MPI load balancing.
• Cuda kernel optimization : balance bw GPU latencies / occupancies.
Technological challenges
• CEA TGCC hybrid clusters : Titane / Curie Target
MetaProf v7.1 - Input data
Each MPI process loads the entire matrix file into memory.
Text file
Input matrix allocation into memory
genes
samples
MetaProf v7.1 - CUDA kernel
• One kernel makes the whole correlation computation
• Input datas are tiled again to fit into shared memory (tile dimensions depend on compute capability of targeted GPUs)
tio
MPI tile A
MPI tile B
Global memory
Shared memory
Number of samples
Size of MPI tile
Result matrix
MetaProf v7.1 – Load balancing
genes
gen
es
Correlation computation
Ouput matrix
MPI rank 0
MPI rank n
MPI rank 0
MPI rank 0
MPI rank n
MPI rank 0
genes
samples
MetaProf v7.1 - Data compute and store
MPI process 0
CUDA Correlation
kernel
MPI rank 0
Ouput matrix
MetaProf v7.1 - Data compute and store
MPI process 0
CUDA Correlation
kernel
MPI rank 0
Ouput matrix
Process 0 bin file
27/02/2012
5
MetaProf v7.1 - Data compute and store
MPI process 0
CUDA Correlation
kernel
MPI rank 0
Ouput matrix
Process 0 bin file
MetaProf v7.1 - Data compute and store
MPI process 0
CUDA Correlation
kernel
MPI rank 0
Ouput matrix
Process 0 bin file
MetaProf v7.1 - Data compute and store
MPI process 0
CUDA Correlation
kernel
MPI rank 0
Ouput matrix
Process 0 bin file
MetaProf v7.1 - Data compute and store
MPI process 0
CUDA Correlation
kernel
MPI rank 0
Ouput matrix
Process 0 bin file
~ 1h 17min
~ 38 min
~ 19 min ~ 10 min ~ 6 min ~ 3 min
~ 1h 23 min
~ 44 min
~ 25 min
~ 16 min ~ 12 min
~ 9 min
0
1000
2000
3000
4000
5000
6000
4 8 16 32 64 128
Tim
e in
sec
MPI processes (2 per nodes)
Time for 1 000 000 genes & 800 samples
Correlation compute time
Total execution time
MetaProf v7.1 - Benchmarks
• CURIE hybrid cluster : • 2 x Intel Westmere per node
• 2 x Tesla 2090 GPU (Fermi - 512 cuda core) per node
MetaProf v7.1 - Weaknesses
• Input matrix is still a text file!
• Each MPI process loads the entire input matrix
High memory occupation with 3.3 million genes
Takes more than 50 % of total execution time with 1 million genes (when running on 128 MPI processes)
• CUDA computation is not optimal
Only one kernel, too many registers used
Some memory accesses are not coalesced
27/02/2012
6
MetaProf v7.2
• Number of Genes >> Number of Samples : an input block can fit in texture and so allows use of texture cache.
• We need a computation order which enables us to load only a part of the input matrix in order to reduce I/O and memory requirements.
Domain decomposition is based solely on the output matrix (Upper Triangular).
• For load-balancing in MPI we use a divide and conquer approach.
• MPI tiles are in turn divided into Cuda blocks.
N/4 processes
N/2 processes
N/4 processes
MetaProf v7.2 - Load balancing
For N processes
Domain decomposition is based solely on the output matrix (Upper Triangular).
• For load-balancing in MPI we use a divide and conquer approach.
• MPI tiles are in turn divided into Cuda blocks.
MetaProf v7.2 - Load balancing
For N = 1 process
MPI process 0
Domain decomposition is based solely on the output matrix (Upper Triangular).
• For load-balancing in MPI we use a divide and conquer approach.
• MPI tiles are in turn divided into Cuda blocks.
MetaProf v7.2 - Load balancing
MPI process 0
MPI process 1
MPI process 0
For N = 2 processes
Domain decomposition is based solely on the output matrix (Upper Triangular).
• For load-balancing in MPI we use a divide and conquer approach.
• MPI tiles are in turn divided into Cuda blocks.
MetaProf v7.2 - Load balancing
MPI process 0 MPI
process 1
MPI process 3
MPI process
2
For N = 4 processes
Domain decomposition is based solely on the output matrix (Upper Triangular).
• For load-balancing in MPI we use a divide and conquer approach.
• MPI tiles are in turn divided into Cuda blocks.
MetaProf v7.2 - Load balancing
MPI 0
MPI 1
MPI 3
MPI 2
MPI 0
MPI 4
MPI 5
MPI 6
MPI 7
MPI 6
For N = 8 processes
27/02/2012
7
Correlation computation
Ouput matrix
MPI rank 0 MPI rank 1
MPI rank n-1
MetaProf v7.2 - Load balancing
For N = n processes
Ouput matrix
MetaProf v7.2 – Data compute
X
Y X
Y
CUDA Correlation
kernel
One MPI process
MetaProf v7.2 – Kernel improvements
Read memory access on 2 2D textures in GPU global memory
Coalesced write
2 kernels to avoid internal
sync and complex index computation
Parameter optimization Particularity GT200 GF100 Consequency
Scheduling unit Half-warp (16 threads)
Warp (32 threads) Low level tilling using size of 16 or 32 in x direction
Multi-processor number
30 14 Block number must be adapted
Number of register 16384 /block 32768/block Some constraints on GT200 are relaxed (easier acces to high occupency)
Parameter GT200 GF100
Block size 16x16x1 32x8x1
Grid size 160x160 80x320
Mpi tile size 2560x2560 2560x2560
~ 48 min
~ 25 min
~ 12 min
~ 6 min ~ 3 min
0
500
1000
1500
2000
2500
3000
3500
8 16 32 64 128
Tim
e in
sec
MPI processes (2 per nodes)
Correlation compute time for 1 000 000 genes & 800 samples
metaprof 7.2
metaprof 7.1
MetaProf v7.2 - Benchmarks
• CURIE hybrid cluster : • 2 x Intel Westmere per node
• 2 x Tesla 2090 GPU (Fermi - 512 cuda core) per node
~ 51 min
~ 26 min
~ 14 min
~ 8 min ~ 5 min
0
500
1000
1500
2000
2500
3000
3500
8 16 32 64 128
Tim
e in
sec
MPI processes (2 per nodes)
Total execution time for 1 000 000 genes & 800 samples
metaprof 7.2
metaprof 7.1
MetaProf v7.2 - Benchmarks
• CURIE hybrid cluster : • 2 x Intel Westmere per node
• 2 x Tesla 2090 GPU (Fermi - 512 cuda core) per node
27/02/2012
8
~ 2h 11 min
~ 1h 7 min
~ 33 min
~ 2h 19 min
~ 1h 12 min
~ 40 min
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
32 64 128
Tim
e in
sec
MPI processes (2 per nodes)
Time for 3 299 823 genes & 800 samples
Correlation compute time
Total execution time
MetaProf v7.2 - Benchmarks
• CURIE hybrid cluster : • 2 x Intel Westmere per node
• 2 x Tesla 2090 GPU (Fermi - 512 cuda core) per node
Conclusion
• FERMI : more than 100 GFlops DP / GPU
• More than 10 16 operations in less than 40 min
• 20 % of th. max perf w/t I/O
GPU programming
level
• Metaprof GPU version applied by INRA MICALIS team to real life study cases
• Results to be published in 2012 : species characterization
Application level
• Benchmark on OpenGPU blade provided by BULL
• Subsequent analysis pipeline should be integrated in Cuda MetaProf
Future work
Sébastien Monot Tarik Saidani Victor Arslan Benjamin Rat
Dany Tello Vincent Ducrot
Dusko Ehrlich Sean Kennedy Nicolas Pons
Nathalie Galleron Benoît Quinquis
BAC TEAM
Pierre Renault Bioinformatique Emmanuelle Le
Chatellier Mathieu Almeida
Biologie Christine Delorme
Eric Guédon Séverine Layec Céline Gautier
Nicolas Sanchez
Jean-Michel Batto
Pierre Léonard
Bouziane Moumen
http://www.netvibes.com/metahit#Live_News
http://twitter.com/metagenomics
http://paper.li/metahit/microbiomics