The Human gut flora - opengpu.netopengpu.net/EN/attachments/154_HiPEAC2012_OpenGPU_INRA-AS+… ·...

27/02/2012

1

HIPEAC 2012 / HIPEAC 2012 / OpenGPUOpenGPU WorkshopWorkshop JanuaryJanuary 2525thth 20122012

MetaProfMetaProf a large scale clustering example with the TGCC CURIE HPC clustera large scale clustering example with the TGCC CURIE HPC cluster

F. Boumezbeur (INRA), Dany Tello, Vincent Ducrot (AS+)F. Boumezbeur (INRA), Dany Tello, Vincent Ducrot (AS+)

MICALIS MICALIS ((MIMIcrobiologiecrobiologie de la de la CChaîne haîne ALIALImentairementaire au service de la au service de la SSanté)anté)‏‏

INRA CRJ - 78350 Jouy-en-Josas (France)‏

AS+ - 22 rue René Coche 92170 Vanves

The Human gut flora Human intestinal bacterial flora contains ~1000 ~1000 bacterialbacterial speciesspecies. Dominant bacteria are present in high abundance.

WhatWhat speciesspecies are are therethere ?? Some are beneficialbeneficial, but others seem to be associated with intestinal intestinal disordersdisorders such as inflammation inflammation diseasesdiseases and obesityobesity.

To To whatwhat extentextent ??

~500 species found in the human gut are being sequenced (cultivated strains).

The The majoritymajority of the of the gutgut bacteriabacteria are are uncultivableuncultivable..

What about uncultivable species ?

The Human gut flora The Human gut flora

« MetaHITMetaHIT » project : A 3.3 million 3.3 million genegene catalogcatalog accounting for

80% of the 80% of the floraflora diversitydiversity has been established in 2010.

October 2011 : the MetaHIT catalog has been updated

3.9 millions of 3.9 millions of genesgenes.

((150150--foldfold more genes than the human genomemore genes than the human genome))

50% of the 50% of the genesgenes couldn’tcouldn’t bebe assignedassigned to to knownknown speciesspecies

JunjieJunjie Qin et al, 2010Qin et al, 2010

The two catalogs contains information on uncultivableuncultivable speciesspecies

The Human gut flora

« MetaHITMetaHIT » project : A 33..3 3 million million genegene catalogcatalog accounting for

8080% of the % of the floraflora diversitydiversity has been established in 2010.

October 2011 : the MetaHIT catalog has been updated

33..9 9 millions of millions of genesgenes.

((150150--foldfold more genes than the human genomemore genes than the human genome))

5050% of the % of the genesgenes couldn’tcouldn’t bebe assignedassigned to to knownknown speciesspecies

JunjieJunjie Qin et al, 2010Qin et al, 2010

But But bothboth of of themthem are not are not structuredstructured in in speciesspecies … … yetyet !!

The two catalogs contains information on uncultivableuncultivable speciesspecies

A successful attempt to describe the human gut flora

ManimozhiyanManimozhiyan ArumugamArumugam et al., et al., 20112011

Quantitative Quantitative metagenomicsmetagenomics relyingrelying on on knownknown bacterialbacterial genomesgenomes

27/02/2012

2

Quantitative metagenomics pipeline

MetabolomicsMetabolomics

EcologyEcology

StatisticsStatistics

IdentificationIdentification

QuantificationQuantification

PolymorphismPolymorphism iMOMiiMOMi

(interactive (interactive MetaOmicsMetaOmics MiningMining))

NGS data = DNA

MetaHITMetaHIT genegene cataloguecatalogue

~ 500 analysed samples during the last 24 months

200.000+ files (20+ To20+ To !)

Acquisition of 2 machines SOLiDSOLiD 55005500

Over Over 22--foldfold increaseincrease in data flow !in data flow !

An avalanche of NGS data!

Human gut bacteria

1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 ….

1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 ….

High through-put sequencing (HTS)

Relative Relative abundancesabundances per per samplesample

Identification & Quantification

Human gut bacteria

1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 ….

Classification of Classification of genesgenes

in in correlatedcorrelated abundanceabundance groupsgroups

Relative Relative abundancesabundances per per samplesample

ClusteringClustering

The Human gut flora : input data

Relative Relative abundanceabundance relyingrelying on the on the MetaHITMetaHIT genegene catalogue : catalogue : eacheach line line isis a a genegene count count vectorvector eacheach columncolumn isis a a samplesample count count vectorvector

2D 2D matrixmatrix of of floatingfloating point valuespoint values … … withwith lots of lots of zeroszeros !!

27/02/2012

3


Statistic Statistic Statistic Statistic analysis analysis analysis analysis

Sequencing output Sample 1 Sample 2 Sample n

Sample i Quality control Quality control Statistics

Preliminary analysis Filtered data from

sample i Filtered data from

sample i Diagnostic

Reference Data Bank

Reads Reads assembling

ic database

DNA Metagenomic database

Annotation Annotation

(NR)

Functional database

(NR)

iMOMi Database &

Tools

Establishment of gene sets or

under study

Establishment of gene sets or specie-like entities for the system

under study

Statistic Statistic analysis

Mapping with Mapping with references

Mapping criterions (Mismatches parameters)


Gene Gene

matrixmatrix

Gene Gene count count matrixmatrix

Gene sets VS samples

Usa

ble

inte

rmed

iate

dat

a

…

Sets of Sets of

GenesGenes

Sets of Sets of uniqueunique--speciesspecies

GenesGenes


Statistic Statistic Statistic Statistic analysis analysis analysis analysis

Sequencing output Sample 1 Sample 2 Sample n

Sample i Quality control Quality control Statistics

Preliminary analysis Filtered data from

sample i Filtered data from

sample i Diagnostic

Reference Data Bank

Reads Reads assembling

ic database

DNA Metagenomic database

Annotation Annotation

(NR)

Functional database

(NR)

iMOMi Database &

Tools

Establishment of gene sets or

under study

Establishment of gene sets or specie-like entities for the system

under study

Statistic Statistic analysis

Mapping with Mapping with references



Gene Gene

matrixmatrix

Gene Gene count count matrixmatrix

Gene sets VS samples

Usa

ble

inte

rmed

iate

dat

a

…

Sets of Sets of

GenesGenes

Sets of Sets of uniqueunique--speciesspecies

GenesGenes

MetaProfMetaProf MetaProfMetaProf

Iterative and incremental development

Literate Programming

a Paira Pair--wisewise Spearman Spearman correlationcorrelation calculatorcalculator

MetaProf (Metagenomic Profiles)

Gene i Gene j Correlation Coefficient

1 2 0.153642

1 3 0.252210

1 4 0.166666

…

1 3 312 399 0.8999990.899999

2 3 0.009781

…

1st 1st MetaHITMetaHIT genegene catalogue:catalogue:

3,3 millions (3,3 millions (3,3x103,3x1066) ) genesgenes

5500 billions (5500 billions (5,5x5,5x10101212) ) correlationscorrelations

MetaProf (Metagenomic Profiles)

2012

2011

2010

2009 V5.0 Sequential

V5.1 Sequential optimized

V5.2 OpenMP

V5.3 MPI/OpenMP

V6 OpenCL

V7 Cuda

V7.1 MPI/Cuda/GT200

V7.2 MPI/Cuda/GF100

MetaProf Timeline Version Hardware platform Programming

model Speedup for 100

000 genes 400 samples

Expected duration for 3M

genes

MetaProf V5.2 Single Node 2 x Intel Xeon X5650

Westmere 6 cores

OpenMP 3.5 18 days

Metaprof V5.3 4 nodes 4x Intel Xeon E5450

Harpertown 4 cores each

MPI+OpenMP 7 4 days

Metaprof V7.0

Single node 1 x Nvidia C1060

Cuda 9,3 3 days

Metaprof V7.1

TGCC/Titane 192 nodes

2 x Nvidia S1070 each

MPI+Cuda

Metaprof V7.2

TGCC/Curie 144 nodes

2x Nvidia 2050 each

MPI+Cuda

MetaProf roadmap

To be detailed

27/02/2012

4

MPI + CUDA implementation

• In most recent studies 3 300 000 genes have to be processed.

Requirements for a faster

implementation

• Data distribution between nodes.

• MPI load balancing.

• Cuda kernel optimization : balance bw GPU latencies / occupancies.

Technological challenges

• CEA TGCC hybrid clusters : Titane / Curie Target

MetaProf v7.1 - Input data

Each MPI process loads the entire matrix file into memory.

Text file

Input matrix allocation into memory

genes

samples

MetaProf v7.1 - CUDA kernel

• One kernel makes the whole correlation computation

• Input datas are tiled again to fit into shared memory (tile dimensions depend on compute capability of targeted GPUs)

tio

MPI tile A

MPI tile B

Global memory

Shared memory

Number of samples

Size of MPI tile

Result matrix

MetaProf v7.1 – Load balancing

genes

gen

es

Correlation computation

Ouput matrix

MPI rank 0

MPI rank n

MPI rank 0

MPI rank 0

MPI rank n

MPI rank 0

genes

samples

MetaProf v7.1 - Data compute and store

MPI process 0

CUDA Correlation

kernel

MPI rank 0

Ouput matrix


MPI process 0

CUDA Correlation

kernel

MPI rank 0

Ouput matrix

Process 0 bin file

27/02/2012

5


MPI process 0

CUDA Correlation

kernel

MPI rank 0

Ouput matrix

Process 0 bin file


MPI process 0

CUDA Correlation

kernel

MPI rank 0

Ouput matrix

Process 0 bin file


MPI process 0

CUDA Correlation

kernel

MPI rank 0

Ouput matrix

Process 0 bin file


MPI process 0

CUDA Correlation

kernel

MPI rank 0

Ouput matrix

Process 0 bin file

~ 1h 17min

~ 38 min

~ 19 min ~ 10 min ~ 6 min ~ 3 min

~ 1h 23 min

~ 44 min

~ 25 min

~ 16 min ~ 12 min

~ 9 min

0

1000

2000

3000

4000

5000

6000

4 8 16 32 64 128

Tim

e in

sec

MPI processes (2 per nodes)

Time for 1 000 000 genes & 800 samples

Correlation compute time

Total execution time

MetaProf v7.1 - Benchmarks

• CURIE hybrid cluster : • 2 x Intel Westmere per node

• 2 x Tesla 2090 GPU (Fermi - 512 cuda core) per node

MetaProf v7.1 - Weaknesses

• Input matrix is still a text file!

• Each MPI process loads the entire input matrix

High memory occupation with 3.3 million genes

Takes more than 50 % of total execution time with 1 million genes (when running on 128 MPI processes)

• CUDA computation is not optimal

Only one kernel, too many registers used

Some memory accesses are not coalesced

27/02/2012

6

MetaProf v7.2

• Number of Genes >> Number of Samples : an input block can fit in texture and so allows use of texture cache.

• We need a computation order which enables us to load only a part of the input matrix in order to reduce I/O and memory requirements.

Domain decomposition is based solely on the output matrix (Upper Triangular).

• For load-balancing in MPI we use a divide and conquer approach.

• MPI tiles are in turn divided into Cuda blocks.

N/4 processes

N/2 processes

N/4 processes

MetaProf v7.2 - Load balancing

For N processes





For N = 1 process

MPI process 0





MPI process 0

MPI process 1

MPI process 0

For N = 2 processes





MPI process 0 MPI

process 1

MPI process 3

MPI process

2

For N = 4 processes





MPI 0

MPI 1

MPI 3

MPI 2

MPI 0

MPI 4

MPI 5

MPI 6

MPI 7

MPI 6

For N = 8 processes

27/02/2012

7

Correlation computation

Ouput matrix

MPI rank 0 MPI rank 1

MPI rank n-1


For N = n processes

Ouput matrix

MetaProf v7.2 – Data compute

X

Y X

Y

CUDA Correlation

kernel

One MPI process

MetaProf v7.2 – Kernel improvements

Read memory access on 2 2D textures in GPU global memory

Coalesced write

2 kernels to avoid internal

sync and complex index computation

Parameter optimization Particularity GT200 GF100 Consequency

Scheduling unit Half-warp (16 threads)

Warp (32 threads) Low level tilling using size of 16 or 32 in x direction

Multi-processor number

30 14 Block number must be adapted

Number of register 16384 /block 32768/block Some constraints on GT200 are relaxed (easier acces to high occupency)

Parameter GT200 GF100

Block size 16x16x1 32x8x1

Grid size 160x160 80x320

Mpi tile size 2560x2560 2560x2560

~ 48 min

~ 25 min

~ 12 min

~ 6 min ~ 3 min

0

500

1000

1500

2000

2500

3000

3500

8 16 32 64 128

Tim

e in

sec


Correlation compute time for 1 000 000 genes & 800 samples

metaprof 7.2

metaprof 7.1




~ 51 min

~ 26 min

~ 14 min

~ 8 min ~ 5 min

0

500

1000

1500

2000

2500

3000

3500

8 16 32 64 128

Tim

e in

sec


Total execution time for 1 000 000 genes & 800 samples

metaprof 7.2

metaprof 7.1




27/02/2012

8

~ 2h 11 min

~ 1h 7 min

~ 33 min

~ 2h 19 min

~ 1h 12 min

~ 40 min

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

32 64 128

Tim

e in

sec


Time for 3 299 823 genes & 800 samples

Correlation compute time

Total execution time




Conclusion

• FERMI : more than 100 GFlops DP / GPU

• More than 10 16 operations in less than 40 min

• 20 % of th. max perf w/t I/O

GPU programming

level

• Metaprof GPU version applied by INRA MICALIS team to real life study cases

• Results to be published in 2012 : species characterization

Application level

• Benchmark on OpenGPU blade provided by BULL

• Subsequent analysis pipeline should be integrated in Cuda MetaProf

Future work

Sébastien Monot Tarik Saidani Victor Arslan Benjamin Rat

Dany Tello Vincent Ducrot

Dusko Ehrlich Sean Kennedy Nicolas Pons

Nathalie Galleron Benoît Quinquis

BAC TEAM

Pierre Renault Bioinformatique Emmanuelle Le

Chatellier Mathieu Almeida

Biologie Christine Delorme

Eric Guédon Séverine Layec Céline Gautier

Nicolas Sanchez

Jean-Michel Batto

Pierre Léonard

Bouziane Moumen

http://www.netvibes.com/metahit#Live_News

http://twitter.com/metagenomics

http://paper.li/metahit/microbiomics

The Human gut flora - opengpu.netopengpu.net/EN/attachments/154_HiPEAC2012_OpenGPU_INRA-AS+… ·...

Documents

Transcript of The Human gut flora - opengpu.netopengpu.net/EN/attachments/154_HiPEAC2012_OpenGPU_INRA-AS+… ·...