CLCAR, Panama R.P. · Tsubame 2.0 4,224 Tesla GPUs + 2,816 x86 CPUs 12,784 x86 CPUs Hopper- NERSC...
Transcript of CLCAR, Panama R.P. · Tsubame 2.0 4,224 Tesla GPUs + 2,816 x86 CPUs 12,784 x86 CPUs Hopper- NERSC...
CLCAR, Panama R.P.Michael P. LasenNVIDIA Professional Solutions GroupLatin America
NVIDIA Processors
GeForceGeForceTMTM y QUADRO y QUADROTM TM
VISUAL COMPUTINGVISUAL COMPUTING TESLATESLATMTM
SUPERCOMPUTINGSUPERCOMPUTINGTEGRATEGRATMTM
MOBILEMOBILE
¿Por qué estamos hablando de cómputo de alto desempeño,
y cómo afecta o puede afectar a tu vida?
La investigacion cientifica requiere 1,000X mas poder computacional.
Energia Renovable Medicina Personalizada
Herramientas para Descubrimiento
Scientifica
Manejo de Informacion Complejo
Maquinas Que Piensan
Interracion Humano Natural con Maquinas
Prediccion de Cambios Ambientales
Analisis Economico y Financiero
Example: Drug DiscoverySimulating Single Bacteria
1982 1997 2003 2006 2010 2012
1,000,000,000
1,000,000
1,000
1
Gigaflops
Estrogen ReceptorEstrogen Receptor36K atoms36K atoms
F1-ATPaseF1-ATPase327K atoms327K atoms
RibosomeRibosome2.7M atoms2.7M atoms
ChromatophoreChromatophore50M atoms50M atoms
BPTIBPTI3K atoms3K atoms
BacteriaBacteria100s of 100s of
ChromatophoresChromatophores
1 ExaFLOPS
1 PetaFLOPS
Ran for 8 months to simulate 2 nanoseconds
1 TeraFLOPS
• HPC es un enfoque nacional…a nivel mundial. Los gobiernos estan invertiendo fuertemente, y con razon.
• HPC es usado hoy mas que nunca para descubrimientos scientificos tanto en sectores privados como publicos.
• HPC es mas barato que nunca. Esta disponible y al alcance de casi cualquier. Con avances en tecnologias heterogeneas, la densidad de cores eficientes es muy alto, y el costo por FLOP es muy bajo.
• HPC es una ventaja competitiva en academia. Fondos para investigacion cada vez mas dependen de esta tecnologia.
HPC: por que te debe importar….
Folding@home 6.1 PFLOPS
MilkyWay@Home 700 TFLOPS
SETI@Home 540 TFLOPS
Einstein@Home 260 TFLOPS
GIMPS 86 TFLOPS
HPC: ya lo estas usando?
A Whole Yotta FLOPS
NOMBRE FLOPS
yottaFLOPS 1024
zettaFLOPS 1021
exaFLOPS 1018
petaFLOPS 1015
teraFLOPS 1012
gigaFLOPS 109
megaFLOPS 106
kiloFLOPS 103
A computer system capable of reaching performance in excess of one petaFLOPS.
One quadrillion floating point operations per second.
Petascale (hoy).
One exaFLOP is a thousand petaFLOPS.10^18 FLOPS
Exascale = Petascale x 1,000
NOMBRE FLOPS
yottaFLOPS 1024
zettaFLOPS 1021
exaFLOPS 1018
petaFLOPS 1015
teraFLOPS 1012
gigaFLOPS 109
megaFLOPS 106
kiloFLOPS 103
GPUGPUCPUCPU
Computo Heterogeneo.Computo Heterogeneo.Acelera Aplicaciones.Acelera Aplicaciones.
1.4 Megawatts2060 Casas en Japon
La SC Petaflop mas Verde del mundo
Tsubame 2.0
4,224 Tesla GPUs + 2,816 x86 CPUs 12,784 x86 CPUs
Hopper- NERSCHopper- NERSC
4.0 MegaWatts5860 Casas en Japon
Dos SC’s Construidas al Mismo Tiempo
Worldwide GPU Supercomputer Momentum
Tesla GPUsLaunched
First Double
Precision GPU
Tesla 20-series
(Fermi)Launched
Who Uses GPU Supercomputing?
Chinese Academy of Sciences
Edu/Research Edu/Research
Air Force ResearchLaboratory
Naval ResearchLaboratory
Government GovernmentOil & Gas Oil & GasMax Planck Institute
Mass GeneralHospital
Life Sciences Life Sciences Finance Finance Manufacturing Manufacturing
What Commercial Apps are They Running on GPU?
MolecularMolecularDynamicsDynamics
OthersOthers
Fluid DynamicsFluid Dynamics
Earth SciencesEarth Sciences
EngineeringEngineeringSimulationSimulation
Agilent EMPro ● ANSYS Mechanical ● ANSYS Nexxim ● CST Microwave Studio
Impetus AFEA ● Remcom XFdtd ● SIMULIA Abaqus
ASUCA ● HOMME ● NASA GEOS-5 ● NOAA NIM ● WRF
Altair Acusolve ● Autodesk Moldflow ● OpenFOAM Prometech Particlework ● Turbostream
AMBER ● CHARMM ● DL_POLY ● GAMESS-US ● GROMACS LAMMPS ● NAMD
GADGET2 ● MATLAB ● Mathematica ● NBODY ● Paradigm VoxelGeo
PARATEC ● Schlumberger Petrel
NAMD es mucho mas rapido7x Aumento en Velocidad con GPUs
ApoA-192,224 Atoms
STMV1,066,628 Atoms
Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On.Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models.
NAMD 2.8 B1 + unreleaesd patch, STMV BenchmarkA Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUsPerformance numbers for 2 M2070 8 cores (GPU+CPU) vs. 8 cores
(CPU)
On October 11, 2011, the Oak Ridge National Laboratory announced it was building a 20 petaFLOP supercomputer, named Titan, which will become operational in 2012, the hybrid Titan system will combine Opteron processors with “Kepler” NVIDIA Tesla graphic processing unit (GPU) technologies.
Given the current speed of progress, supercomputers are projected to reach 1 exaFLOPS (EFLOPS) in 2019. Cray, Inc. announced in December 2009 a plan to build a 1 EFLOPS supercomputer before 2020.
Erik P. DeBenedictis of Sandia National Laboratories theorizes that a zettaFLOPS (ZFLOPS) computer is required to accomplish full weather modeling of two week time span. Such systems might be built around 2030.
YottaFLOPS? Finally, the complete simulation of the human brain.
What’s next in supercomputing?
Titan at Oak Ridge National LabsTitan at Oak Ridge National LabsWorld’s Top Open Science Computing Research FacilityWorld’s Top Open Science Computing Research Facility
2x mas rapido, 3x mas eficiente x Watt.2x mas rapido, 3x mas eficiente x Watt.Mas eficiente que la SC #1 hoy (K Computer)Mas eficiente que la SC #1 hoy (K Computer)
18,000 GPUs Tesla18,000 GPUs Tesla
20+ Petaflops20+ Petaflops
~90% de los FLOPS ~90% de los FLOPS vienen de los GPUsvienen de los GPUs
Power Crisis in Supercomputing
1982 1996 2008 2020
Exaflop
Petaflop
Teraflop
Gigaflop
Household Power Equivalent
City
Town
Neighborhood
Block
7,000,000 Watts7,000,000 Watts
25,000,000 Watts25,000,000 Watts
850,000 Watts850,000 Watts
60,000 Watts60,000 Watts
2 GigawattsHoover Dam
DATA: U.S. Dept. of Energy
Exascale with CPUs TodayExascale with CPUs Today
Personal Computing ARM Servers
ARM Enables Energy Efficient Computing
ARM is Pervasive and OpenU
nits
in B
illio
ns
Source: ARM, Mercury Research, NVIDIA
ARM
x86
Annual Shipments
Project DenverProject DenverNVIDIA-Designed
High Performance ARM CPU
1
100
PER
FO
RM
AN
CE
2012 2014
WAYNEWAYNE
20132010 2011
TEGRA 2TEGRA 2
TEGRA 3TEGRA 3
LOGANLOGAN
10
Core 2 DuoCore 2 Duo
STARKSTARK
Core i5Core i5
Tegra
CUDA GPU Tegra ARM CPU
CARMA DevKitCUDA for ARM Development Kit
Tegra 3 Quad-core ARM A9Quadro 1000M (96 CUDA cores)
Ubuntu
Gigabit EthernetSATA Connector
HDMI, DisplayPort, USB
Pre-register on www.nvidia.com/CARMADevKitLaunch Q2 2012
World’s First ARM CPU / CUDA GPU Supercomputer
Mont Blanc Research Project
Exploring energy efficient
supercomputer architectures for
exascale
ARM CPU + GPU Prototype
256 ARM CPUs + GPUs
http://www.montblanc-project.eu
http://www.eesi-project.eu/media/BarcelonaConference/Day2/13-Mont-Blanc_Overview.pdf
HPC: cuanto cuesta?
2000
Sandia National Lab
ASCI Red
2TFlops (DP)
2011
NVIDIA
Personal SuperComputer
2TFlops (DP)
ASCI RedPersonal
SuperComputer
Rendimiento 2 TFlops 2 TFlops
Nodos Computacionales 4736 1
Procesador – Tipo Pentium II Tesla C2075
Procesador - Cantidad 9472 4
Gabinete 104 racks 1 workstation
Espacio Ocupado 230 m2 0,12 m2
Consumo Energia 0.85 MW 1400 W
Costo US$ 100~200 milliones U$30K
Homemade Desktop Supercomputer with Tesla
Univ Industrial Santander, Bucaramanga, Colombia
8 nodos8 nodos2 XEON + 8 Tesla C20502 XEON + 8 Tesla C2050
24 CPU + 64 GPU24 CPU + 64 GPU
52 TerraFLOPS52 TerraFLOPS
A Professional GPU Cluster
Iniciando con GPU y CUDA
Probar
Probar CUDA con una portatilo equipo de escritorio con una GPU.
DesarrollarOptimiza applicaciones con estacion
de trabajo con GPU’s Tesla
EscalarCorre aplicaciones en un cluster de
GPU’s para computo paralelo masivo
Rendimiento
EficienciaAccesibilidad
KEPLER
Tesla CUDA Architecture Roadmap
16
2
4
6
8
10
12
14
DP G
FLO
PS
per
Wat
t
2008 2010 2012 2014
T10T10 FermiFermi
KeplerKepler
MaxwellMaxwell
3xRend / Watt
LÓGICA CONTROLADORA
192 núcleos192 núcleosMax 1536 x GPUMax 1536 x GPU
LÓGICA CONTROLADORA
32 núcleos32 núcleosMax 512 x GPUMax 512 x GPU
SMFermi
SMXKepler
Kepler: Rápida y Eficiente
Tesla K10 vs M2090: 2x Rendimiento / Watt
3x Precisión Sencilla
1.8x Ancho de Banda de Memoria
Imágenes, Señales, Sísmico
3x Precisión Doble
Hyper-Q, Paralelismo Dinámico
CFD, FEA, Finanzas, Física
Tesla K10 Tesla K20
Disponible 4T 2012Disponible Ahora
Tesla K10: Mismo Consumo, 2x Rendimiento de Fermi
Product Name
M2090 K10
Arquitectura de GPU Fermi Kepler GK104
# de GPU 1 2
Board Per GPUFlops Precisión Única 1.3 TF 4.58 TF 2.29 TF
Flops Doble Precisión 0.66 TF 0.190 TF 0.095 TF
# Núcleos CUDA 512 3072 1536Tamaño de Memoria 6 GB 8 GB 4GB
Memoria (sin ECC) 177.6 GB/s 320 GB/s 160GB/s
PCI-Express Gen 2: 8 GB/s Gen 3: 16 GB/s