Extract IBANs, Currencies, and Addresses from Financial Documents — Validated, Not Just Strings

Extract IBANs, Currencies, and Addresses from Financial Documents — Validated, Not Just Strings

从财务文档中提取 IBAN、货币和地址——不仅是字符串,更是经过验证的数据

Regex Doesn’t Understand Financial Data. You need to extract an IBAN from a bank statement. So you write a regex. It matches something that looks like an IBAN — starts with two letters, followed by two digits, followed by up to 30 alphanumeric characters. The pattern matches. You store the result. Except the regex can’t validate the IBAN. It doesn’t check the country-specific length. It doesn’t verify the check digits. It matches “DE89 3704 0044 0532 0130 00” and “DE00 0000 0000 0000 0000 00” with equal confidence. One is a real IBAN, the other is garbage. 正则表达式无法理解财务数据。假设你需要从银行对账单中提取 IBAN(国际银行账户号码),于是你编写了一个正则表达式。它匹配看起来像 IBAN 的内容——以两个字母开头,后跟两位数字,再接最多 30 个字母数字字符。模式匹配成功,你存储了结果。但问题在于,正则表达式无法验证 IBAN 的有效性。它不会检查特定国家的长度要求,也不会校验检查位。它对 “DE89 3704 0044 0532 0130 00” 和 “DE00 0000 0000 0000 0000 00” 的匹配信心是一样的。前者是真实的 IBAN,后者则是垃圾数据。

Currency amounts are worse. Is “1.234,56” one thousand two hundred thirty-four euros and fifty-six cents? Or is it one point two three four with some trailing nonsense? Depends on the locale. A regex can’t know. The Iteration Layer’s Document Extraction API has purpose-built field types for financial data — IBAN, CURRENCY_AMOUNT, CURRENCY_CODE, and ADDRESS. They don’t just extract text. They validate, normalize, and structure the results. 货币金额的情况更糟。“1.234,56” 是一千二百三十四欧元五十六欧分吗?还是指 1.234 后面跟着一些无意义的后缀?这取决于地区设置,而正则表达式无法知晓。Iteration Layer 的文档提取 API 针对财务数据提供了专用字段类型——IBAN、CURRENCY_AMOUNT、CURRENCY_CODE 和 ADDRESS。它们不仅是提取文本,还会对结果进行验证、标准化和结构化处理。

IBAN Extraction with Validation

带有验证功能的 IBAN 提取

The IBAN field type extracts International Bank Account Numbers and validates them: IBAN 字段类型可以提取国际银行账户号码并对其进行验证:

const schema = {
  fields: [
    {
      name: "beneficiary_iban",
      type: "IBAN",
      description: "IBAN of the payment recipient",
      is_required: true,
    },
    {
      name: "beneficiary_name",
      type: "TEXT",
      description: "Name of the payment recipient",
    },
  ],
};

The response returns a validated IBAN string: 响应会返回一个经过验证的 IBAN 字符串:

{
  "beneficiaryIban": {
    "type": "IBAN",
    "value": "DE89370400440532013000",
    "confidence": 0.96
  }
}

This isn’t a regex match. The parser identifies the IBAN in the document, validates its format against the country-specific rules, and returns it in the standard format. If the document contains something that looks like an IBAN but doesn’t validate, the confidence score reflects that. 这并非简单的正则表达式匹配。解析器会识别文档中的 IBAN,根据特定国家的规则验证其格式,并以标准格式返回。如果文档中包含看起来像 IBAN 但无法通过验证的内容,置信度分数会反映出这一点。

IBAN Validation Edge Cases

IBAN 验证的边缘情况

IBAN formats vary by country. A German IBAN is 22 characters. A Norwegian IBAN is 15. A Maltese IBAN is 31. A regex that accepts “2 letters + 2 digits + up to 30 alphanumeric characters” matches all of them — and also matches strings that aren’t IBANs at all. The IBAN field type validates several things that a regex cannot. It checks the country-specific length (DE must be exactly 22 characters, not 20 or 24). It verifies the check digits using the MOD-97 algorithm defined in ISO 13616. It ensures the BBAN (Basic Bank Account Number) portion follows the country’s structure — for Germany, that’s an 8-digit bank code followed by a 10-digit account number. IBAN 格式因国家而异。德国 IBAN 为 22 位,挪威为 15 位,马耳他为 31 位。一个接受“2 个字母 + 2 位数字 + 最多 30 个字母数字字符”的正则表达式会匹配所有这些格式,同时也会匹配根本不是 IBAN 的字符串。IBAN 字段类型可以验证正则表达式无法处理的多个方面:它会检查特定国家的长度(德国必须正好 22 位,不能是 20 或 24 位);它使用 ISO 13616 定义的 MOD-97 算法验证检查位;它确保 BBAN(基本银行账号)部分符合该国的结构——例如德国,即 8 位银行代码后跟 10 位账号。

Documents often contain multiple strings that look like IBANs. Account statements might list the account holder’s IBAN, the sender’s IBAN, a reference number that happens to start with two letters, and a transaction ID with similar formatting. The field description helps the parser identify which one you want. “IBAN of the payment recipient” is more specific than “IBAN” — and that specificity matters when a document has three valid IBANs on the same page. 文档中通常包含多个看起来像 IBAN 的字符串。账户对账单可能列出账户持有人的 IBAN、汇款人的 IBAN、恰好以两个字母开头的参考编号,以及格式相似的交易 ID。字段描述有助于解析器识别你想要哪一个。“收款人的 IBAN”比单纯的“IBAN”更具体——当文档同一页面上有三个有效的 IBAN 时,这种精确性至关重要。

Currency Amounts Across Locales

跨地区的货币金额

The CURRENCY_AMOUNT field type handles the formatting chaos of international financial documents: CURRENCY_AMOUNT 字段类型处理国际财务文档中混乱的格式问题:

  • “$1,234.56” — US format with comma thousands separator and period decimal
  • “$1,234.56” — 美国格式,使用逗号作为千位分隔符,小数点为点号
  • “1.234,56 €” — European format with period thousands separator and comma decimal
  • “1.234,56 €” — 欧洲格式,使用点号作为千位分隔符,小数点为逗号
  • “CHF 1’234.56” — Swiss format with apostrophe thousands separator
  • “CHF 1’234.56” — 瑞士格式,使用撇号作为千位分隔符
  • “¥123,456” — no decimal places
  • “¥123,456” — 无小数位

A regex parser needs a different pattern for each locale. The CURRENCY_AMOUNT field type handles all of them and returns a normalized numeric value: 正则表达式解析器需要为每个地区编写不同的模式。而 CURRENCY_AMOUNT 字段类型可以处理所有这些情况,并返回标准化的数值:

const schema = {
  fields: [
    {
      name: "invoice_total",
      type: "CURRENCY_AMOUNT",
      description: "Total invoice amount",
      is_required: true,
    },
    {
      name: "currency",
      type: "CURRENCY_CODE",
      description: "Currency of the invoice (ISO 4217 code)",
    },
  ],
};
{
  "invoiceTotal": {
    "type": "CURRENCY_AMOUNT",
    "value": 1234.56,
    "confidence": 0.95
  },
  "currency": {
    "type": "CURRENCY_CODE",
    "value": "EUR",
    "confidence": 0.94
  }
}

The amount comes back as a number — not a string with locale-specific formatting. The currency comes back as an ISO 4217 code — not a symbol that could mean multiple currencies ($ is used by USD, CAD, AUD, and dozens of others). 金额以数字形式返回,而不是带有特定地区格式的字符串。货币以 ISO 4217 代码形式返回,而不是可能代表多种货币的符号($ 符号被美元、加元、澳元及其他数十种货币使用)。

Currency Disambiguation

货币消歧

Currency symbols are ambiguous. The $ sign is used by the US dollar, Canadian dollar, Australian dollar, Hong Kong dollar, Singapore dollar, and at least 20 other currencies. The kr symbol could be Swedish krona, Norwegian krone, or Danish krone. Even “FR” on a document could mean French francs (obsolete) or something else entirely. The CURRENCY_CODE field type returns an ISO 4217 three-letter code — USD, CAD, AUD, SEK, NOK. No ambiguity. The parser uses context from the document to determine the correct currency: the issuing bank’s country, the document language, other addresses on the page. A Swiss bank statement showing “Fr. 1’234.56” returns CHF, not some generic “franc” designation. When a document uses multiple currencies — a foreign exchange confirmation, for example — define separate CURRENCY_CODE fields for each. “Source currency of the exchange” and “target currency of the exchange” give the parser enough context to distinguish them. 货币符号具有歧义。$ 符号被美元、加元、澳元、港币、新加坡元以及至少 20 种其他货币使用。kr 符号可能是瑞典克朗、挪威克朗或丹麦克朗。即使是文档上的 “FR” 也可能指代法国法郎(已废弃)或其他完全不同的含义。CURRENCY_CODE 字段类型返回 ISO 4217 三字母代码——如 USD、CAD、AUD、SEK、NOK,消除了歧义。解析器利用文档上下文来确定正确的货币:发卡行所在国家、文档语言、页面上的其他地址等。一份显示 “Fr. 1’234.56” 的瑞士银行对账单会返回 CHF,而不是通用的“法郎”名称。当文档使用多种货币时(例如外汇确认单),请为每种货币定义单独的 CURRENCY_CODE 字段。“兑换源货币”和“兑换目标货币”能为解析器提供足够的上下文来区分它们。

ADDRESS Decomposition

地址分解

Financial documents are full of addresses — billing addresses, beneficiary addresses, registered office addresses. The ADDRESS field type doesn’t just extract the address as a text blob. It decomposes it into structured components: 财务文档中充斥着各种地址——账单地址、收款人地址、注册办公地址。ADDRESS 字段类型不仅将地址提取为文本块,还会将其分解为结构化组件:

const schema = {
  fields: [
    {
      name: "billing_address",
      type: "ADDRESS",
      description: "Billing address of the customer",
    },
  ],
};
{
  "billingAddress": {
    "type": "ADDRESS",
    "value": {
      "street": "Kurfürstendamm 194",
      "city": "Berlin",
      "region": "Berlin",
      "postal_code": "10707",
      "country": "DE"
    },
    "confidence": 0.93
  }
}

The country is an ISO 3166-1 alpha-2 code. The components are split out and ready for your database — no address parsing library needed. 国家/地区代码采用 ISO 3166-1 alpha-2 标准。组件已被拆分并可直接存入数据库,无需额外的地址解析库。

International Address Formats

国际地址格式

Addresses are surprisingly hard to decompose. A US address has a street, city, state, and ZIP code in a predictable order. A Japanese address starts with the prefecture and works… 地址的分解难度出人意料。美国地址的街道、城市、州和邮政编码顺序是可预测的。而日本地址则以都道府县开头,并按……(原文截断)