How to split 10GB JSON files in seconds without hitting RAM limits

How to split 10GB JSON files in seconds without hitting RAM limits

如何在不触及内存限制的情况下,几秒钟内拆分 10GB 的 JSON 文件

Hi Everyone! We had this classic pain point on our project: constantly chewing through massive JSON arrays. Catalogs, analytics dumps, ML datasets — files ranging from a couple of hundred megabytes to tens of gigabytes. 大家好!我们的项目曾遇到一个经典的痛点:需要不断处理海量的 JSON 数组。无论是目录、分析数据转储还是机器学习数据集,文件大小从几百兆到几十个 GB 不等。

The task was stupidly simple: split a giant JSON array into individual elements so we could chunk them or throw them into parallel processing. No data transformation, no querying by keys. We literally just needed to find where each chunk starts and ends. 任务其实非常简单:将一个巨大的 JSON 数组拆分成独立的元素,以便我们可以对其进行分块或并行处理。不需要数据转换,也不需要按键查询。我们实际上只需要找到每个数据块的起始和结束位置。

Naturally, we started with the classic approach: json.Unmarshal -> slice -> json.Marshal. On a 10GB file, memory consumption went to the moon 🚀. We ended up spending more time fighting the Go garbage collector (GC) than doing actual work. 理所当然,我们最初采用了经典做法:json.Unmarshal -> 切片 -> json.Marshal。处理 10GB 文件时,内存占用直接飙升 🚀。结果我们花在与 Go 垃圾回收器 (GC) 斗智斗勇上的时间,比实际工作的时间还要多。

And then it clicked: to just move the data around, we don’t need to understand what’s inside it. We just need to find the boundaries. Stop parsing, start scanning 🛑 后来我突然意识到:仅仅是为了移动数据,我们根本不需要理解里面的内容。我们只需要找到边界。停止解析,开始扫描 🛑

Every parser out there (even the ultra-fast ones like sonic or simdjson) still builds a tree in memory. Instead, you can just treat the JSON as a raw byte stream. Look for structural markers, find the edges, and cut. 市面上所有的解析器(即使是像 sonic 或 simdjson 这样超快的解析器)仍然会在内存中构建一棵树。相反,你可以直接把 JSON 当作原始字节流来处理。寻找结构标记,找到边缘,然后切割。

The entire logic boils down to a tiny state machine: 整个逻辑可以简化为一个微小的状态机:

  • Nesting counter: { and [ go +1, } and ] go -1.
  • 嵌套计数器: {[ 加 1,}] 减 1。
  • String tracking: keep track of when you enter ”…” so you don’t accidentally react to brackets inside a text field.
  • 字符串追踪: 记录何时进入 ”…”,这样你就不会意外地处理文本字段内的括号。
  • Escapes: a \" inside a string is a trap, not the end of the string.
  • 转义字符: 字符串内的 \" 是个陷阱,它不是字符串的结尾。
  • The boundary: whenever your nesting depth is exactly 0, any comma , is where you split.
  • 边界: 当嵌套深度恰好为 0 时,任何逗号 , 都是拆分点。

That’s it. We don’t care about keys or values. We don’t allocate a single byte, we just return memory views (slices) of the original buffer. 就是这样。我们不关心键或值。我们不需要分配任何字节,只需返回原始缓冲区的内存视图(切片)即可。

Here’s what the concept looks like in Go (oversimplified, ignoring string logic): 以下是该概念在 Go 中的实现(已简化,忽略了字符串逻辑):

func findElements(data []byte) []Chunk {
    var chunks []Chunk
    depth := 0
    start := 0
    for i, b := range data {
        switch b {
        case '{', '[': depth++
        case '}', ']': depth--
        case ',':
            if depth == 0 {
                chunks = append(chunks, Chunk{Start: start, End: i})
                start = i + 1
            }
        }
    }
    if start < len(data) {
        chunks = append(chunks, Chunk{Start: start, End: len(data)})
    }
    return chunks
}

Obviously, this naive code will break on the first tricky whitespace or string, but you get the point. We aren’t parsing. We are scanning. 显然,这段简单的代码在遇到复杂的空格或字符串时会出错,但你明白我的意思。我们不是在解析,而是在扫描。

Why is this so damn fast? ⚡ 为什么这玩意儿这么快?⚡

  • Zero allocations in the hot loop. You’re just handing back data[start:end]. No new objects, no copying strings, no building hash maps.
  • 热循环中零分配。 你只是返回 data[start:end]。没有新对象,没有字符串拷贝,也没有构建哈希映射。
  • Hardware absolutely loves it. Your entire working state is basically two integers. It easily fits in L1 cache, and memory reads are strictly sequential.
  • 硬件非常喜欢它。 你的整个工作状态基本上就是两个整数。它很容易放入 L1 缓存,且内存读取是严格顺序的。
  • The branch predictor is happy. A simple state machine with highly predictable transitions is infinitely easier for the CPU to digest than a full parser juggling dozens of token types.
  • 分支预测器很开心。 一个转换高度可预测的简单状态机,比处理数十种标记类型的完整解析器更容易让 CPU 消化。

Look at how much work we are skipping: 看看我们省去了多少工作:

StepStandard ParserBoundary Scanner
Read bytes
Classify tokensOnly {}[]”\ and ,
Build hash maps
Allocate strings
Allocate slices
Type conversion
步骤标准解析器边界扫描器
读取字节
分类标记仅 {}[]”\ 和 ,
构建哈希映射
分配字符串
分配切片
类型转换

What you get back: []MyStruct vs [][]byte (pointers to original buffer). We are literally throwing away 80% of the overhead. 你得到的结果是:[]MyStruct 对比 [][]byte(指向原始缓冲区的指针)。我们实际上抛弃了 80% 的开销。

But how fast is it actually? 🏎️ 但它到底有多快?🏎️

I got a bit carried away and polished this into a production-ready tool. I added proper string handling, escape tracking, and rewrote the hot loop in AVX2 assembly (chewing through 32 bytes per cycle using SIMD bitmasks). Tbh, the results surprised even me: 我有点上头,把它打磨成了一个生产就绪的工具。我添加了正确的字符串处理、转义追踪,并用 AVX2 汇编重写了热循环(使用 SIMD 位掩码每周期处理 32 字节)。老实说,结果连我自己都惊到了:

ApproachWhat it doesThroughputMemory Overhead
encoding/jsonFull parse → Go structs~107 MB/s3-4x input size
sonic / simdjson-goOptimized parse → structs/AST~400–700 MB/s~1.1x
My AVX2 scannerJust finds boundaries~4.1 GB/s~1.0x (zero extra)
方法功能吞吐量内存开销
encoding/json完整解析 → Go 结构体~107 MB/s输入大小的 3-4 倍
sonic / simdjson-go优化解析 → 结构体/AST~400–700 MB/s~1.1 倍
我的 AVX2 扫描器仅查找边界~4.1 GB/s~1.0 倍(零额外开销)

At 4.1 GB/s, the algorithm isn’t even the bottleneck anymore. It’s bottlenecked by the RAM’s read bandwidth. The CPU is just sitting there waiting for the next cache line to arrive. 在 4.1 GB/s 的速度下,算法已经不再是瓶颈了。瓶颈在于内存的读取带宽。CPU 只是在等待下一个缓存行到达。

The catch (Tradeoffs) ⚠️ 代价(权衡)⚠️

  • Platform-specific: The AVX2 branch only works on amd64. For ARM (hello MacBooks), you need a pure Go fallback.
  • 平台相关: AVX2 分支仅适用于 amd64。对于 ARM(比如 MacBook),你需要一个纯 Go 的回退方案。
  • Memory lifecycle danger: You are getting slices that point directly to the original buffer. If that []byte gets overwritten or GC’d while you’re still working with the chunks… it’s going to hurt.
  • 内存生命周期风险: 你得到的切片直接指向原始缓冲区。如果该 []byte 在你处理数据块时被覆盖或被 GC 回收……那会很麻烦。
  • No validation: The scanner takes your word that the JSON is valid. Feed it garbage, and it will silently slice up garbage.
  • 无验证: 扫描器默认你提供的 JSON 是有效的。如果你喂给它垃圾数据,它会默默地把垃圾数据切开。

TL;DR 总结

The biggest insight was stupidly simple: stop thinking “I need to parse this JSON” and start thinking “I need to find boundaries in a byte stream”. Once I changed my perspective, the code wrote itself and the performance gap was massive. 最大的启发简单得离谱:不要再想“我需要解析这个 JSON”,而要开始想“我需要在字节流中找到边界”。一旦我改变了视角,代码就顺理成章地写出来了,性能差距也极其巨大。

Has anyone else suffered through this? How do you guys route or chunk massive JSON payloads in production when you simply can’t fit them into RAM? 👇 还有其他人经历过这种痛苦吗?当你们在生产环境中无法将海量 JSON 载荷放入内存时,你们是如何路由或分块处理它们的?👇

If anyone wants to poke around the assembly or run the benchmarks, the repo is here: 如果有人想研究一下汇编代码或运行基准测试,仓库在这里:

🔗 GitHub: GenshIv/silentjson