The emergence of the web data infrastructure layer for AI
The emergence of the web data infrastructure layer for AI
AI 网络数据基础设施层的兴起
AI is booming. New use cases are emerging each day. To capitalize on the technology’s potential, enterprises require data at scale. In many cases, though, the relevant information is blocked or unstructured, which limits its use by AI models. 人工智能正处于蓬勃发展期,新的应用场景层出不穷。为了充分利用这项技术的潜力,企业需要大规模的数据支持。然而在许多情况下,相关信息要么被屏蔽,要么是非结构化的,这限制了其在人工智能模型中的应用。
To understand this challenge, consider the foundation of the web itself. The web was not designed for the automated discovery and retrieval that new AI applications demand. Overcoming this inherent design constraint requires infrastructure. The next frontier in AI may depend on a new web data infrastructure layer that can enable models to discover and map this ever-expanding digital realm. 要理解这一挑战,我们需要审视互联网本身的基础。互联网的设计初衷并非为了满足新型人工智能应用所要求的自动化发现与检索。克服这一固有的设计局限需要基础设施的支持。人工智能的下一个前沿领域可能取决于一个新的网络数据基础设施层,它能够使模型发现并映射这个不断扩张的数字领域。
This layer must be able to navigate hundreds of millions of existing web domains and billions of new URLs created each week, delivering real-time information and overcoming technical barriers. “The data suggests there’s far more data out there,” says Or Lenchner, CEO of Bright Data, a web data collection platform. “Think of the universe: It’s out there, but you don’t know what you don’t know.” 这一层必须能够浏览数以亿计的现有网站域名以及每周新增的数十亿个 URL,从而提供实时信息并克服技术障碍。网络数据采集平台 Bright Data 的首席执行官 Or Lenchner 表示:“数据表明,外部世界存在着远超我们想象的数据。就像宇宙一样:它就在那里,但你不知道自己还有多少未知。”
Enabling access to fresh, relevant, and trustworthy data
实现对新鲜、相关且可信数据的访问
While early AI breakthroughs were driven by scaling training data and model size, organizations are now encountering a fundamental bottleneck: They need to keep pace with the dynamic, unstructured, and constantly evolving nature of web data in order to ground outputs in current and verifiable information. 虽然早期的人工智能突破主要得益于训练数据和模型规模的扩大,但企业现在正面临一个根本性的瓶颈:他们需要跟上网络数据动态、非结构化且不断演变的特性,以便将输出建立在最新且可验证的信息基础之上。
AI performance increasingly depends not just on model architecture but on a system’s compute, networking, retrieval, and data engineering capabilities—that is, the system’s ability to quickly and reliably retrieve data that is fresh, relevant, and trustworthy. 人工智能的性能越来越不仅取决于模型架构,还取决于系统的计算、网络、检索和数据工程能力——即系统快速、可靠地检索新鲜、相关且可信数据的能力。
Traditional model training relies on snapshots of information collected at a particular point in time. Training AI on such static data is no longer sufficient. To track fluctuations such as competitor pricing, consumer sentiment, and market trends, companies need a constant feed of new information, pulling data in real time along with relevant context. 传统的模型训练依赖于在特定时间点收集的信息快照。在这样的静态数据上训练人工智能已不再足够。为了跟踪竞争对手定价、消费者情绪和市场趋势等波动,企业需要持续获取新信息,实时提取数据及其相关背景。
Their infrastructure must therefore be able to handle millions of simultaneous interactions across websites that vary by geography, language, format, and access rules. “If it can’t retrieve real-time information, it lacks context,” Lenchner says. “In a business setting, that’s not acceptable anymore. Stale answers lead to bad decisions and disappointed consumers.” 因此,他们的基础设施必须能够处理跨越不同地理位置、语言、格式和访问规则的网站的数百万次并发交互。Lenchner 说:“如果无法检索实时信息,它就缺乏背景。在商业环境中,这是不可接受的。陈旧的答案会导致错误的决策和消费者的失望。”
Speed is not merely a matter of convenience; it’s a matter of necessity. Today’s organizations operate in environments where prices, inventory, markets, security threats, and customer behavior change continuously. Delayed data retrieval can reduce the usefulness of an otherwise sophisticated model. 速度不仅仅是便利的问题,更是必要的问题。当今的企业在价格、库存、市场、安全威胁和客户行为不断变化的环境中运营。延迟的数据检索会降低原本复杂模型的实用性。
Using live, high-quality web data can also reduce AI hallucinations because the model has a more relevant knowledge base. This builds user trust. In fact, one survey found that 56% of AI practitioners said businesses need access to real-time web data to improve trust in AI outputs. 使用实时、高质量的网络数据还可以减少人工智能的“幻觉”,因为模型拥有更相关的知识库。这能建立用户信任。事实上,一项调查发现,56% 的人工智能从业者表示,企业需要访问实时网络数据,以提高对人工智能输出结果的信任度。
To ensure the model runs efficiently and effectively, the information must also be pared down to the appropriate essentials. Despite the introduction of retrieval-augmented generation (RAG), where models pull in external data at the moment of a query, many AI systems still struggle to deliver outputs that are current, contextually relevant, and trustworthy in operational settings. 为了确保模型高效、有效地运行,信息还必须精简到适当的要素。尽管引入了检索增强生成(RAG)技术(即模型在查询时提取外部数据),但许多人工智能系统在实际操作环境中仍难以提供最新、语境相关且可信的输出。
According to Gartner, 60% of AI projects that are not supported by AI-ready data—accurate, structured, organized, and contextualized—will be abandoned by the end of the year. This is because large-scale retrieval alone does not solve the problem. As Lenchner puts it, “You need to retrieve data at scale, but also in real time. Latency becomes an issue because of the end user who is waiting for the output.” 根据 Gartner 的数据,60% 没有得到“AI 就绪”数据(即准确、结构化、有组织且具有语境的数据)支持的人工智能项目将在年底前被放弃。这是因为仅靠大规模检索并不能解决问题。正如 Lenchner 所言:“你需要大规模地检索数据,但同时也必须是实时的。由于终端用户在等待输出,延迟就成了一个问题。”
Accessing fresh, AI-ready data at scale introduces technical and structural challenges. In practice, many enterprise systems combine public web retrieval with APIs, licensed datasets, and proprietary internal data in their AI applications. Integrating these fragmented sources into a timely and usable knowledge layer requires specialized capabilities. 大规模访问新鲜的“AI 就绪”数据带来了技术和结构上的挑战。在实践中,许多企业系统在人工智能应用中结合了公共网络检索、API、授权数据集和专有的内部数据。将这些碎片化的来源整合到一个及时且可用的知识层中,需要专门的能力。
Some research has found that 97% of AI organizations depend on real-time web data infrastructure, but 90% feel boxed in by various restrictions. Companies are increasingly developing technical approaches to navigate these constraints. Lenchner draws this metaphor: “Think of the trained model as intelligence and relevant data as knowledge. A powerful intelligence layer sitting on top of a hollow knowledge layer is like a genius who knows nothing—useless in practice. Intelligence and knowledge have to come together.” 一些研究发现,97% 的人工智能组织依赖于实时网络数据基础设施,但 90% 的组织感到受到各种限制的束缚。企业正越来越多地开发技术手段来应对这些限制。Lenchner 打了个比方:“把训练好的模型看作智能,把相关数据看作知识。一个强大的智能层如果建立在一个空洞的知识层之上,就像一个什么都不知道的天才——在实践中毫无用处。智能和知识必须结合在一起。”
The promise of new infrastructure
新基础设施的愿景
A new layer of web data infrastructure can address this developing need for stronger AI inputs by enabling discovery of data, real-time access, and tailoring to a specific context. As Lechner describes it, “It’s all about collecting data at scale, super-low latency, without being blocked.” 一个新的网络数据基础设施层可以通过实现数据发现、实时访问以及针对特定语境的定制,来满足对更强人工智能输入的需求。正如 Lechner 所描述的那样:“关键在于大规模、超低延迟地收集数据,且不被屏蔽。”
Rather than relying on increased computing power, this type of platform emulates human browsing behavior to access available content and transform raw code into structured data feeds. It can work with websites that might not interact with traditional scraping tools, such as those heavy in JavaScript, or with aggressive antibot software. 这种平台不再依赖增加计算能力,而是模拟人类的浏览行为来访问可用内容,并将原始代码转换为结构化数据流。它可以处理那些可能无法与传统抓取工具交互的网站,例如大量使用 JavaScript 的网站,或具有严格反机器人软件的网站。
As Lenchner explains, “It’s basically having infrastructure that can mimic a web user with identifying information—IP address, location, and 1,000 more parameters. And at scale. Think of doing that 80 billion times a day for millions of websites. And every single time, you are looking exactly as the website expects you to look.” 正如 Lenchner 解释的那样:“这基本上是拥有一种能够模拟网络用户的基础设施,包含身份识别信息——IP 地址、位置以及其他 1000 个参数。而且是大规模地进行。想象一下,每天在数百万个网站上执行 8 万亿次这样的操作。而且每一次,你的表现都完全符合网站的预期。”
Of course, continuous retrieval introduces new data governance challenges. To address them, platforms can enforce strict compliance protocols aligned with global privacy frameworks, such as the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). They can also be limited to openly accessible. 当然,持续的检索带来了新的数据治理挑战。为了解决这些问题,平台可以执行符合全球隐私框架(如欧盟的《通用数据保护条例》(GDPR) 和加州的《消费者隐私法案》(CCPA))的严格合规协议。它们也可以仅限于公开可访问的内容。