[Group C] A Closer Look at GPGPU

Time: March 30, 2016 from 6:30pm to 9pm
Location: Dream Center (IPO hall)
Event Type: seminar
Organized By: Byung-Hak Kim
Latest Activity: 2 hours ago

0630

introduction for everyone
저녁은 도시락

0710

안녕하세요 C 그룹 여러분!

지난 주까지 AlphaGo와 이세돌 9단의 대국 경기가 생중계 되면서, 큰 관심을 얻었는데요. 사실 AlphaGo의 deep learning을 가능하게 한 브레인은 100개가 넘는 GPU(Graphics Processing Units)로 알려져 있습니다.

AlphaGo의 GPU 한장에는 개인용 컴퓨터에 들어가는 CPU보다 최소 8배 이상 빨리 계산할 수 있는 능력이 있어, 바둑 한 수를 두기 위해 최고급 컴퓨터 최소 1000대 이상을 한꺼번에 사용하는 것과 같은데요.

이름처럼 그래픽 연산에 특화된 GPU를 3D뿐 아니라 보다 다양한 용도로 사용하기 위한 것이 바로 GPGPU(General-Purpose GPU)의 핵심입니다. 순차적으로 코딩을 처리하는 CPU와 달리 분산 처리하는 GPGPU는 특별히 슈퍼컴퓨팅에서 각광받고 있습니다.

그래서 C그룹에서는 ‘‘A Closer Look at GPGPU!’’라는 주제로 2016년 두번째 세미나를 준비했습니다. 대학에서 이 분야 연구와 개발을 경험한 스피커로부터 GPGPU architecture, GPGPU computing platform CUDA, 그리고 energy-efficient GPGPU computing까지 듣는 자리에 여러분을 초대합니다.

Speaker: 전혜란 교수 (San Jose State University, website)

시간: 3/30(수) 6:30 - 9:00 장소: Dream Center (IPO hall), 1290 Reamwood Ave. Bldg B, Sunnyvale, CA 94089 회비: $15 (저녁/음료 제공), $10 (현장 등록, 저녁식사 준비가 안 됩니다) 결제방법: 준비를 위해서 PayPal 결제([email protected])를 3/29(화)까지 해주세요. RSVP: http://www.bayareakgroup.org/events/group-c-seminar-a-closer-look-at-gpgpu

감사합니다. C그룹 운영진드림

0710

전혜란 교수
하드웨어 경기가 안좋다. 대학원생들이 소프트웨어 쪽으로 갈려고 한다.
AlphaGo Deep learning
why go was s challenging
- 250^150 = 10^360 possible sequences
- atoms in the universe! (-== 10^80)
- can’t traverse all possible move
why alphago was different
- effectively reduced depth and breadth of the search tree via Convolutional Neural Network
  - board position is 19x19 images
  - human experts plays (supervised learning) and self play(reinforcement learning)
  - w networks
    - value network: evluating positions
    - policy network: sampling actions “mastering the game of go with deep nueral networks and treee search, nature 19” => action item***
CNN: convlutional newral network
- image processing에서 많이 사용함
- face recognition
- nVidia cuDNN
- 192개의 필터로
CNN
- created from large numbers of identical neurons that are highly parallel by nature
- map natually GPUs, significant speed-up over CPU-only traning
- Alphago used 1202 CPUs and 176 GPUS
- nvida cuDNN, Caffe, tensorflow, torch
- nvidia cuDNN on GPU = 14x (cuDNN on K40, CPU 24 core ES-2697v2)
software -> where to send GPU vs CPU
- 현재는 사람이 다한다.
- 2007-2008 -> 관련 논문이 좀 있었는데,
- GPU는 newural network 돌리는데만 쓰고, CPU는 몬테카를로
- 하드웨어 갯수가 늘린다고, 리니어하게 더 빨라지지 않는다.
- 어플리케이션 패러럴로 하는것이 항상 코어를 100개 200개를 다 쓸 수 있는 건 아니다.
GPU, high performance comuting engine
- GFLOP/s 그래프
- 장점 it’s everywhere, 따로 사야하는 건 아니다.
- tree -> cpu가 낫다 depth first. dependency
- 자기 어플리케이션에서 어느 부분이 GPU에 적합한지를 알아야 한다.
- CNN:
  - 19x19 matrix를 그냥 이미지로 줬다.
  - 텍스쳐링 하는 것 같이.
super computer with gpu
- Titan -> world #2 supercomputer
CPU vs GPU
- OoO CPU Architecture: lines
- GPU (nvidia fermi) architecture: no fancy branch predictor, in-order execution
  - techreport.com/review//nvida-fermi-gpu-archiecture-revealed
History
- CPU: multiple powerful cores
- GPU: thousands of simple cores
  - 그림 -> pixel의 조합
  - 하나가 한 픽셀 담당
  - 모든 GPU 코어가 자기가 가진 값만 다르고 같은 오프레이션을 한다. (brightness + 20)
  - SIMD function
Typical GPU function
- vertex
- geometry
- pixel
- render output unit(ROP)
- framebuffer
tradigional gpu architecture
- <- instruction
underutilization issue in the traditinal GPU architecture
- workloads have different processing loads
- heavy geometry workload
- heavy pixel workload
- for heavy geometry load: 24points / 8 geometry processors = 3 cycles
GPU pipeline evolution
- year 2006
- before
  - Vertex, Gemoetry, Pixel: use the simple
- after
  - unifed shader model
  - any work can be performed on any shader core -> high performance computing
  - nothing but AMU
unifed shader model
- 24 core, all identical [G]
  - 24 point /24 = 1 cycle
- another type of under utilization => memory wall
- memory operation takes too long
- shader cores should be idle until the requested data arrives
- underutilization issue in unified shader model
stall latency hiding by using multiple thread groups
- thread group 1, , stall, runnable
- now, there’s no idle period
- but larger register file is needed to maintain the context of multiple thread groups
GPU에 큰 레지스트 파일을 넣어서 ㅏ임쉐어링을 한다.
current gpu vs traditional GPU
- unifed shader core
- stall latency hiding: when a group stalls, work on another group
  - TFLOPS of peak performances
- 한 GPU에 몇 MB의 쓰레드를 돌릴수 있다. 이코드가 레지스터 몇개를 사용하는지, shared memory를 몇 바이트를 사용하는지 컴파일러가 다 분석해준다.
- 메모리에서 레지스터로 데이터가 욺겨진다.
- 개별쓰레드들이 가지는 메모리가 레즈스터에 있다.
State of the art GPU
- fermi: 2010년 아키텍쳐 (nv코어)
  - nvidia tesla
  - warp schedulers: warp가 스케줄링이 된다. 스케줄링하는 방법이 여러가지
  - large register file
  - idential scalar cores
  - separate memory cores
  - L1/L2 caches
QA
- amd -> unifed memory system을 먼저 내놨다.
  - APU를 만듬 한다움
  - nvidia는 virtual memory 어드레스가 같은 쪽에 있도록
- amd랑 nv랑
  - amd는 opencl
  - nv -> warp, opencl -> wavefront
- thread에서 lockless로 할 경우
  - GPU는 왠만해서는 serial 하게
  - L1/L2캐쉬도 원래는 없었는데, cache coherience 가 들어갔다. CPU캐쉬랑 똑같다.
- Deep learning을 한다면 AMD가 낫냐 NV가 났냐?
  - NV가 굉장히 aggressive하게 마켓을 뚫어놨다. 성능상으로 CUDA를 못따라감
- 단위? polygon per sec, pixel per sec 이었는데 요새는 GFLOPS로 쓰는데 다른 단위 없냐?
  - operation per watt
Then what is GPGPU?
- unifed shader -> general purpose computing
- General Purpose Computing on GPU
  - GPU + programmability ( CUDA, OpenCL)
- CUDA: CUDA C and API
- OpenCL
Terminalogy
- CUDA C | OpenCL | what it means
- kernel | kernel |
- thread block | work group |
- thread | work item |
- global memory | global memory
- shared memory
- warp | wavefront
system organization and GPGPU execution model
- GPGPU communicates with CPU via PCI-E bus
  - SM(Steeaming) =>
    - SP, LD/ST, SFU, Register File, Shared Memory, Warp
    - SP => arithemetic operation(int, float) ALU (32bit) Fermi(64bit)
    - LD/ST => memory operation
    - SFU => 복잡한 operation
- CPU에서 kernel이라고 메인을 불러주면
  - Thread block (CTA) warp -> thread block
QA
GPU 크기
- zetson board: tegra(모바일)
- 학생들이랑 돌려보는데, 사양차이가 많이난다. 데스크탑은 16개 SM
- zetson board는 2개 SM있다.
- warp split
- nvlink => pci-e 보다 빠르게 할라고하는 것
- external gpu with thunderbolt
- atomic_sum
example CUDA code
- simple matrix addition
- n*n
c code for i for j c[i][j]
cuda code __global int i = threadIdx.x int j = threadIdx.y c[i][j] = A[i][j] + B[i][j]
host side cuda code cudaMalloc cudamemcpy dim3 dimGrid(1,1,1) dim3 dimBlock(N, N, 1)


matadd<<<dimgrid, dimblock>>> : gpu의 연산 <<<, >>>

cudamemcpy() -> 리턴값을 cpuapahflfh cudathreadsynchronize() => 결과값을 다 한버에 받는다. cudafree()

GPU simulator
Two-level compilation
- NVCC: host랑 GPU랑 같이 컴파일하게
  - separates code running on the host from code running on the device
- two-stage compilation
  - virtual isa(PTX)
  - just-in-time device-specific binary
warp execution
- 32 thread at a time
- program counter share
- because thread within a swap sahre a PC value, in a diverged control flow, some threads should execute one flow but the others not
- active mask, warp execution
- if threadIdx.x % 2 == 0 -> ret = funcA; -> ret = funcB -> dst[threaIdx.x] = ret;
- activemask -> runtime determine?
  - diverged control flow
memory hierarchy
CUDA variable type qualifiers
- 메모리가 많아서 종류를 개발자가 정해줘야 한다.
- register, shared, global, constant
- device, shared
memory access coalescing
- 32,64,128 byte transaction
- align
- one transaction within the boundry
Nbody example(laptop)
- i7-4870hq 4 cores * 2 threads per core 2.5GHz
- geforece GT 750M 2sms & 912 cores per sm: cpu -> openmp
- 926core clocks
- laptop gpu: cpu보다 2ro qhek Qkfmek.

Industry and Research Trend

nvidia server gpu
- DP GFLOPS per watt
- 2008 tesla cuda
- 2010 fermi fp64
- 2012 kepler dynamic parallelism
- 2014 maxwell unified virtual memory
- 2016 volta stached dram
QA: CUDA, heaillight -> cuda 상위언어
AMD GPU
- HSA Evolution
- Liano
- Trinity
- Kaveri
HSA(Hetrogeneous System Association) foundation - arm, imagination, ti, qualcom, samsung, mediatek
- OpenCL
- APU(CPU and GPU in onechip)
- HSA Solution Stack
Industry trend - Libraries
- cuDNN; deep neural network
- cuFFT: fast fourier transform
- cuSPARSE: sparse matrix
- cuBLAS: basic linear algebra subprograms
- cuRAND: random number generation
Tools
- simulators
  - GPGPU-Sim (UBC) **
  - gem5-gpu (Wisconsin)
  - gpuocelot (GaTech)
- Benchmark suites
  - nvidia cuda sdk **
  - rodinia (UVA) **
  - mars
  - parboil (Illinois)
Under Utilization
- warp dynamic formation [Fung et al. MICRO’07]
- warp-level dual modular redundancy pJeon and Annavaram MICRO’12]
Scheduler
- multi-level warp scheduler [Gebhart et al. ISCA’11]
- cache conscious wavefront scheduler [Rogers et al. MICRO’12]
Energy Efficiency
- Large register file leads to high energy dissipation
- register file cache [Gebhart et al. ISCA’11]
- register compression [Lee et al.ISCA’15]
- register file under-provisioning [Jeon et al. MICRO ‘15]
Security? Reliability? Real-time?
- GPUs are getting adopted to embedded systems
- Not well studied so far in embedded domain
- in our group, we are exploring these issues for embedded GPUs - stay tuned!
- automotive partners
- zetson board

GPGPU

Computing

GPGPU tech seminar

[Group C] A Closer Look at GPGPU

0630

0710

0710

unifed shader core

Industry and Research Trend

Popular

Latest

Category

Tag Cloud