cuda - Writing from Device to Host and notifying the host -


using cuda 5 vs 2012 , capability 3.5 (titan , k20).

at particular stages of kernel execution, want send generated data chunk host memory , notify host data ready, host operate on it.

i cannot wait until end of kernel execution read data device, because:

  1. the data no longer relevant device once calculated, there no point keeping end.
  2. the data size large fit on device memory , wait until end.
  3. the host should not have wait until end of kernel execution start processing data.

could point me path have take , possible cuda concepts , functions have use achieve requirements? put simply, how can write host , notify host chunk data ready host processing?

n.b. each thread not share generated data other thread, run independently. so, far know (and please correct me if wrong), concept of blocks, threads , warps not affect question. or in other words, if aid answer, free alter combination.

below sample code shows trying do:

#pragma once #include <conio.h> #include <cstdio> #include <cuda_runtime_api.h>  __global__ void kernel(size_t length, float* hresult)  {     int tid = threadidx.x + blockidx.x * blockdim.x;     // processing multiple data chunks     for(int = 0;i < length;i++)     {         // once assigned, don't need on device anymore.         hresult[i + (tid * length)] = * 100;     }  }  void main() {     size_t length = 10;     size_t threads = 2;     float* hresult;     // array hold data threads     cudamallochost((void**)&hresult, threads * length * sizeof(float));     kernel<<<threads,1>>>(length, hresult);     // not want wait end , block data     cudaerror_t error = cudadevicesynchronize();     if (error != cudasuccess) { throw error; }     for(int = 0;i < threads * length;i++)     {         printf("%f\n", hresult[i]);;     }     cudafreehost(hresult);     system("pause"); } 

at high level, on device:

  • you'll need write data either device global memory (allocated cudamalloc) or else directly host memory (allocated cudahostalloc)
  • you may wish data writing region single threadblock, sure data written prior following steps
  • you'll want issue threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior following steps
  • next you'll write special location in device global memory or host memory, let's call mailbox location, specific value indicating data ready.
  • optionally issue threadfence or threadfence_system call

on host:

  • before launching kernel, host need set mailbox location default value.
  • after launching kernel, host thread need "poll" mailbox location, looking specific value indicating data ready
  • once specific value seen, indicating data ready, host can consume data
  • optionally, if want repeat process, host can reset mailbox location default value. device can check default value before updating data block new data.

note above process, there still implied device-wide synchronization needed, if data being generated/created multiple threadblocks. straightforward device-wide synchronization available kernel launch (or completion of kernel, specifically). copying data single threadblock moves requirement device-wide sync out of particular sequence (to somewhere before sequence).

the reasons give don't suggest me code not refactored create data on kernel-launch kernel-launch basis, neatly solve these issues , eliminate need above process well.

edit: responding question in comments. it's difficult more specific how refactor code deliver 1 data chunk per kernel call, without specific example.

let's take image processing case, have video sequence of 30 frames stored in global memory. kernel process each frame according algorithm, make processed data available host.

in proposal, after kernel done processing frame, can signal host data ready, , go on process next frame. problem is, if frame processed multiple threadblocks, there's no easy way know when threadblocks done processing frame. device-wide synchronization barrier might needed, doesn't exist conveniently, except via kernel call mechanism. however, presumably inside such kernel might have sequence this:

  • while (more_frames)
    • process frame
    • signal host
    • increment frame pointer

in refactored approach, move loop outside kernel, host code:

  • while (more_frames)
    • call kernel process frame
    • consume frame
    • increment frame pointer

by doing this, kernel marks explicit synchronization needed know when frame processing complete, , data can consumed.


Comments