[PR]

2025年04月25日

[PR]上記の広告は3ヶ月以上新規記事投稿のないブログに表示されています。新しい記事を書く事で広告が消えます。

C#でOpenCL入門 (Cloo版) スレッドとグループの個数

スレッドとグループの個数

こちらも合わせてお読みください。

GPGPUでは普通複数のスレッドを使います。
複数のスレッドで仕事をバーっとやってもらうのです。

そのスレッドの個数はCPU側で設定します。
ときに、GPU側でもスレッドの個数がいくつかを知りたいことがあります。

それには、get_global_size()関数を使います。
この関数は、今のコマンドで、スレッドが全部でいくつ実行されているかを取得します。
参考
size_t get_global_size(uint dimIndex)
dimIndexは、次元を表すインデックスです。X軸方向のスレッド個数を取得したいなら０，Y軸方向なら１，Zなら２です。

サンプルコード

Program.cs

using Cloo;
using System.Linq;

class Program
{
    static void Main()
    {
        ComputePlatform platform = ComputePlatform.Platforms[0];
        ComputeDevice[] devices = platform
            .Devices
            .Where(d => d.Type == ComputeDeviceTypes.Gpu)
            .ToArray();
        ComputeContext context = new ComputeContext(
            devices,
            new ComputeContextPropertyList(platform),
            null, 
            System.IntPtr.Zero
            );
        ComputeProgram program = new ComputeProgram(
            context,
            System.IO.File.ReadAllText("myKernelProgram.cl")
            );
        program.Build(devices, null, null, System.IntPtr.Zero);
        const int elementCount = 6;
        ComputeBuffer<float> buffer = new ComputeBuffer<float>(
            context, 
            ComputeMemoryFlags.ReadWrite,
            elementCount
            );

        ComputeKernel kernel = program.CreateKernel("myKernelFunction");
        kernel.SetMemoryArgument(0, buffer);
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            context,
            devices[0],
            ComputeCommandQueueFlags.None
            );

        commandQueue.Execute(
            kernel,
            null, 
            new long[] { 2, 3 },
            new long[] { 1, 1 }, 
            null
            );

        var dataFromGpu = new float[elementCount];
        commandQueue.ReadFromBuffer(
            buffer,
            ref dataFromGpu,
            true,
            null
            );

        foreach (var item in dataFromGpu)
        {
            System.Console.WriteLine(item);
        }

        commandQueue.Dispose();
        kernel.Dispose();
        buffer.Dispose();
        program.Dispose();
        context.Dispose();
    }
}

myKernelProgram.cl

__kernel void myKernelFunction(__global float* items)
{
	items[get_global_id(0)] = 
		get_global_size(0);
		//get_global_size(1);
}

このプログラムは、スレッドを（２×３）個作ります。そして、バッファの先頭に、０番目の次元のスレッド個数を書き込みます。ここでは０番目の次元を書き込んでいるので、２という数字（２×３の２）を書き込みます。

実行結果はこうなります。

myKernelProgram.clのコメントアウト位置を変えると、今度は１番目の次元の個数を書き込みます。そうすると、バッファの先頭には３が書きこまれます（２×３の３です）。

グループのサイズ

以上では、スレッドの総数をGPU側で取得しました。
しかし、１グループ内のスレッド数を取得したい場合もあります。
それにはget_local_size関数を使います。
参考
size_t get_local_size(uint dimIndex)
dimIndexは取得するサイズの次元です。

たとえばスレッドを｛｛スレッド１，スレッド２、スレッド３｝、｛スレッド４、スレッド５,スレッド６｝｝と実行したとします。
全部で６つ。
１グループ３スレッドです。
そいう言う場合は、get_local_size(0)が３になります。
１グループ内のスレッド数を取得するのです。
サンプルを次のように改変します：

Program.cs

using Cloo;
using System.Linq;

class Program
{
    static void Main()
    {
        ComputePlatform platform = ComputePlatform.Platforms[0];
        ComputeDevice[] devices = platform
            .Devices
            .Where(d => d.Type == ComputeDeviceTypes.Gpu)
            .ToArray();
        ComputeContext context = new ComputeContext(
            devices,
            new ComputeContextPropertyList(platform),
            null, 
            System.IntPtr.Zero
            );
        ComputeProgram program = new ComputeProgram(
            context,
            System.IO.File.ReadAllText("myKernelProgram.cl")
            );
        program.Build(devices, null, null, System.IntPtr.Zero);
        const int elementCount = 24;
        ComputeBuffer<float> buffer = new ComputeBuffer<float>(
            context, 
            ComputeMemoryFlags.ReadWrite,
            elementCount
            );

        ComputeKernel kernel = program.CreateKernel("myKernelFunction");
        kernel.SetMemoryArgument(0, buffer);
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            context,
            devices[0],
            ComputeCommandQueueFlags.None
            );

        commandQueue.Execute(
            kernel,
            null, 
            new long[] { 4, 6 },
            new long[] { 2, 3 }, 
            null
            );

        var dataFromGpu = new float[elementCount];
        commandQueue.ReadFromBuffer(
            buffer,
            ref dataFromGpu,
            true,
            null
            );

        foreach (var item in dataFromGpu)
        {
            System.Console.WriteLine(item);
        }

        commandQueue.Dispose();
        kernel.Dispose();
        buffer.Dispose();
        program.Dispose();
        context.Dispose();
    }
}

myKernelProgram.cl

__kernel void myKernelFunction(__global float* items)
{
	items[get_global_id(0)] = 
		get_local_size(0);
		//get_local_size(1);
}

これを実行すると次のようになります：

このプログラムはまず４×６このスレッドを作り、２×３のスレッドを１グループにして、グループ分けしています。そしてバッファの先頭に、ローカルサイズ（１グループのスレッド数）を書き込みます。この場合は１グループにX方向には２ある（２×３の２）ので、２が書きこまれています。

グループの個数

スレッドが次のように実行されているとしましょう：
｛｛スレッド0、スレッド1｝、｛スレッド2、スレッド3｝、｛スレッド4、スレッド5｝｝
このとき、グループの個数は３個です。

グループの個数を得るには、get_num_groups()関数を使います。参考
size_t get_num_groups(uint dimIndex)
dimIndexは次元のインデックスです。X軸方向のグループの数を得たい時には０。Y軸方向なら１。Zなら２です。

Program.cs

using Cloo;
using System.Linq;

class Program
{
    static void Main()
    {
        ComputePlatform platform = ComputePlatform.Platforms[0];
        ComputeDevice[] devices = platform
            .Devices
            .Where(d => d.Type == ComputeDeviceTypes.Gpu)
            .ToArray();
        ComputeContext context = new ComputeContext(
            devices,
            new ComputeContextPropertyList(platform),
            null, 
            System.IntPtr.Zero
            );
        ComputeProgram program = new ComputeProgram(
            context,
            System.IO.File.ReadAllText("myKernelProgram.cl")
            );
        program.Build(devices, null, null, System.IntPtr.Zero);
        const int elementCount = 24;
        ComputeBuffer<float> buffer = new ComputeBuffer<float>(
            context, 
            ComputeMemoryFlags.ReadWrite,
            elementCount
            );

        ComputeKernel kernel = program.CreateKernel("myKernelFunction");
        kernel.SetMemoryArgument(0, buffer);
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            context,
            devices[0],
            ComputeCommandQueueFlags.None
            );

        commandQueue.Execute(
            kernel,
            null, 
            new long[] { 2, 12 },
            new long[] { 2, 3 }, 
            null
            );

        var dataFromGpu = new float[elementCount];
        commandQueue.ReadFromBuffer(
            buffer,
            ref dataFromGpu,
            true,
            null
            );

        foreach (var item in dataFromGpu)
        {
            System.Console.WriteLine(item);
        }

        commandQueue.Dispose();
        kernel.Dispose();
        buffer.Dispose();
        program.Dispose();
        context.Dispose();
    }
}

myKernelProgram.cl

__kernel void myKernelFunction(__global float* items)
{
	items[get_global_id(0)] = 
		get_num_groups(0);
		//get_num_groups(1);
}

このプログラムは、２×１２個のスレッドを作り、それを２×３のスレッドを持つグループでグループ分けしています。そうすると、グループの数は１×４となります。結果、このプログラムは次のような文字列を出力します：

グループの数は１×４なので、その最初の１を出力したのです。もしmyKernelProgram.clのコメントアウト位置を変えれば、４を出力するでしょう。

[0回]

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30