CUDA-oxide: Nvidia's official Rust to CUDA compiler

CUDA-oxide: Nvidia’s official Rust to CUDA compiler

CUDA-oxide：Nvidia 官方推出的 Rust 转 CUDA 编译器

The cuda-oxide Book# cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX — no DSLs, no foreign language bindings, just Rust. cuda-oxide 手册：cuda-oxide 是一个实验性的 Rust 转 CUDA 编译器，允许你使用相对安全且地道的 Rust 语言编写 SIMT（单指令多线程）GPU 内核。它将标准的 Rust 代码直接编译为 PTX——无需领域特定语言（DSL），无需外部语言绑定，纯粹的 Rust。

Note This book assumes familiarity with the Rust programming language, including ownership, traits, and generics. Later chapters on async GPU programming also assume working knowledge of async/.await and runtimes like tokio. For a refresher, see The Rust Programming Language, Rust by Example, or the Async Book. 注意：本书假设读者熟悉 Rust 编程语言，包括所有权、特征（traits）和泛型。关于异步 GPU 编程的后续章节，还假设读者具备 async/.await 以及 tokio 等运行时的实践知识。如需复习，请参阅《The Rust Programming Language》、《Rust by Example》或《Async Book》。

Project Status# The v0.1.0 release is an early-stage alpha: expect bugs, incomplete features, and API breakage as we work to improve it. We hope you’ll try it and help shape its direction by sharing feedback on your experience. 🚀 项目状态：v0.1.0 版本处于早期 Alpha 阶段：在我们不断改进的过程中，可能会遇到 Bug、功能不完整以及 API 变更。我们希望你能尝试使用它，并通过分享你的使用反馈来帮助塑造它的发展方向。🚀

Quick start# 快速入门#

use cuda_device::{cuda_module, kernel, thread, DisjointSlice};
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};

#[cuda_module]
mod kernels {
    use super::*;
    #[kernel]
    fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) {
        let idx = thread::index_1d();
        let i = idx.get();
        if let Some(c_elem) = c.get_mut(idx) {
            *c_elem = a[i] + b[i];
        }
    }
}

fn main() {
    let ctx = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = kernels::load(&ctx).unwrap();
    let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap();
    let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap();
    let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    module
        .vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c)
        .unwrap();

    let result = c.to_host_vec(&stream).unwrap();
    assert_eq!(result[0], 3.0);
}

Build and run with cargo oxide run vecadd upon installing the prerequisites. Note #[cuda_module] embeds the generated device artifact into the host binary and generates a typed kernels::load function plus one launch method per kernel. The lower-level load_kernel_module and cuda_launch! APIs remain available when you need to load a specific sidecar artifact or build custom launch code. 在安装必要的前提条件后，使用 cargo oxide run vecadd 进行构建和运行。注意：#[cuda_module] 会将生成的设备制品嵌入到宿主二进制文件中，并生成一个类型化的 kernels::load 函数以及每个内核对应的启动方法。当你需要加载特定的辅助制品或构建自定义启动代码时，底层的 load_kernel_module 和 cuda_launch! API 依然可用。

Why cuda-oxide?# 为什么选择 cuda-oxide？#

🦀 Rust on the GPU: Write GPU kernels with Rust’s type system and ownership model. Safety is a first-class goal, but GPUs have subtleties — read about the safety model. 🦀 GPU 上的 Rust：使用 Rust 的类型系统和所有权模型编写 GPU 内核。安全性是首要目标，但 GPU 有其特殊性——请阅读关于安全模型的说明。

💎 A SIMT Compiler: Not a DSL. A custom rustc codegen backend that compiles pure Rust to PTX. 💎 SIMT 编译器：非 DSL。这是一个自定义的 rustc 代码生成后端，可将纯 Rust 代码编译为 PTX。

⚡ Async Execution: Compose GPU work as lazy DeviceOperation graphs. Schedule across stream pools. Await results with .await. ⚡ 异步执行：将 GPU 工作组合为惰性的 DeviceOperation 图。在流池（stream pools）中进行调度。使用 .await 等待结果。