Building Custom Recognizers

Building Custom Recognizers / 构建自定义识别器

Presidio’s built-in recognizers cover the common PII types: names, emails, phone numbers, credit cards, SSNs. But every organization has PII that’s specific to their business. Internal employee IDs that follow a custom format. Project codenames that shouldn’t leak externally. Customer account numbers that don’t match any standard pattern. Medical record numbers, policy IDs, internal ticket references. The built-in recognizers don’t know about these. This part covers four ways to build custom recognizers, from the simplest (a list of words to flag) to the most sophisticated (connecting an external NLP service).

Presidio 内置的识别器涵盖了常见的个人身份信息（PII）类型：姓名、电子邮件、电话号码、信用卡号和社保号（SSN）。但每个组织都有其业务特有的 PII 数据，例如遵循自定义格式的内部员工 ID、不应外泄的项目代号、不符合任何标准模式的客户账号、病历号、保单 ID 以及内部工单引用等。内置识别器无法识别这些内容。本节将介绍构建自定义识别器的四种方法，从最简单的（标记词汇列表）到最复杂的（连接外部 NLP 服务）。

Deny-List Recognizers / 黑名单识别器

The fastest way to add a custom recognizer is a deny list. You give Presidio a list of words or phrases and it flags any exact match as a specific entity type. Use case: your company has internal project codenames (like “Project Titan,” “Sapphire,” “Nightingale”) that are confidential and should never appear in data sent to external services.

添加自定义识别器最快的方法是使用黑名单（Deny List）。你只需向 Presidio 提供一份单词或短语列表，它就会将任何精确匹配的内容标记为特定的实体类型。使用场景：贵公司拥有内部项目代号（如 “Project Titan”、“Sapphire”、“Nightingale”），这些代号属于机密，绝不应出现在发送给外部服务的数据中。

from presidio_analyzer import AnalyzerEngine, PatternRecognizer

# Create a deny-list recognizer
project_recognizer = PatternRecognizer(
    supported_entity="INTERNAL_PROJECT",
    deny_list=["Titan", "Sapphire", "Nightingale", "Ironclad", "Meridian"],
    deny_list_score=1.0
)

# Add it to the analyzer
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(project_recognizer)

# Test it
text = "The Titan rollout is scheduled for Q3. Contact sarah@company.com for details."
results = analyzer.analyze(text=text, language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

# Output:
# INTERNAL_PROJECT: 'Titan' (score: 1.00)
# EMAIL_ADDRESS: 'sarah@company.com' (score: 1.00)

The deny_list_score parameter sets the confidence level for matches. Set it to 1.0 if the deny list is curated and every match is definitely PII. Lower it if some terms might appear in non-sensitive contexts. Deny lists are case-insensitive by default. “titan,” “TITAN,” and “Titan” all match.

deny_list_score 参数用于设置匹配的置信度。如果黑名单经过精心筛选且每个匹配项都绝对是 PII，请将其设置为 1.0。如果某些术语可能出现在非敏感语境中，则可以降低该值。黑名单默认不区分大小写，因此 “titan”、“TITAN” 和 “Titan” 均会被匹配。

Regex Recognizers / 正则表达式识别器

When your PII follows a pattern but the built-in recognizers don’t cover it, write a regex recognizer. Use case: your company uses employee IDs in the format EMP-XXXXX (EMP- followed by 5 digits) and customer account numbers in the format ACC-XXXX-XXXX.

当你的 PII 遵循某种模式但内置识别器无法覆盖时，可以编写正则表达式识别器。使用场景：贵公司使用格式为 EMP-XXXXX（EMP- 后跟 5 位数字）的员工 ID，以及格式为 ACC-XXXX-XXXX 的客户账号。

from presidio_analyzer import PatternRecognizer, Pattern

# Employee ID recognizer
emp_id_pattern = Pattern(
    name="employee_id_pattern",
    regex=r"\bEMP-\d{5}\b",
    score=0.9
)
emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[emp_id_pattern],
    name="EmployeeIdRecognizer"
)

# Customer account recognizer
account_pattern = Pattern(
    name="account_number_pattern",
    regex=r"\bACC-\d{4}-\d{4}\b",
    score=0.9
)
account_recognizer = PatternRecognizer(
    supported_entity="CUSTOMER_ACCOUNT",
    patterns=[account_pattern],
    name="CustomerAccountRecognizer"
)

# Register both
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(emp_recognizer)
analyzer.registry.add_recognizer(account_recognizer)

text = "Employee EMP-28471 processed refund for account ACC-9921-0047."
results = analyzer.analyze(text=text, language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

# Output:
# EMPLOYEE_ID: 'EMP-28471' (score: 0.90)
# CUSTOMER_ACCOUNT: 'ACC-9921-0047' (score: 0.90)

The score in the Pattern object sets the base confidence. You can define multiple patterns for the same entity type if the format varies (some systems might use EMP-XXXXX and others use E-XXXXXXX).

Pattern 对象中的 score 设置了基础置信度。如果格式多样（例如某些系统使用 EMP-XXXXX，而另一些使用 E-XXXXXXX），你可以为同一实体类型定义多个模式。

Context Enhancement / 上下文增强

Regex patterns alone can produce false positives. A pattern like \d{5} matches any 5-digit number, not just employee IDs. Context words help Presidio distinguish between a zip code and an employee number.

仅靠正则表达式可能会产生误报。例如 \d{5} 这样的模式会匹配任何 5 位数字，而不仅仅是员工 ID。上下文词汇可以帮助 Presidio 区分邮政编码和员工编号。

from presidio_analyzer import PatternRecognizer, Pattern

# A medical record number recognizer with context
mrn_pattern = Pattern(
    name="mrn_pattern",
    regex=r"\b\d{7,10}\b",
    score=0.3 # Low base score because 7-10 digit numbers are common
)
mrn_recognizer = PatternRecognizer(
    supported_entity="MEDICAL_RECORD",
    patterns=[mrn_pattern],
    context=["medical record", "mrn", "patient id", "patient number", "chart number", "medical id", "health record"],
    name="MedicalRecordRecognizer"
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(mrn_recognizer)

# With context: high confidence
text1 = "Patient medical record number: 4829173"
results1 = analyzer.analyze(text=text1, language="en")
# Score boosted because "medical record number" is a context word

# Without context: low confidence (might be filtered by threshold)
text2 = "Order 4829173 shipped on Tuesday"
results2 = analyzer.analyze(text=text2, language="en")
# Score stays at base 0.3 because no context words present

The pattern starts with a low base score (0.3). When context words appear within a configurable window around the match, Presidio boosts the score. When they don’t, the score stays low and gets filtered out by your threshold. This is the right approach for any pattern that’s too generic on its own. Set a low base score, provide strong context words, and let the context scoring do the disambiguation.

该模式以较低的基础分数（0.3）开始。当匹配项周围的可配置窗口内出现上下文词汇时，Presidio 会提高分数。如果没有出现，分数将保持较低水平并被你的阈值过滤掉。对于任何本身过于通用的模式，这都是正确的方法：设置一个低基础分数，提供强有力的上下文词汇，并让上下文评分机制来完成消歧。

No-Code Recognizers via YAML / 通过 YAML 实现无代码识别器

For teams that want to manage recognizers without touching Python code, Presidio supports YAML-based configuration. You define recognizers in a YAML file and load them at startup.

对于希望在不编写 Python 代码的情况下管理识别器的团队，Presidio 支持基于 YAML 的配置。你可以在 YAML 文件中定义识别器，并在启动时加载它们。

# custom_recognizers.yaml
recognizers:
  - name: "Project Code Recognizer"
    supported_language: "en"
    supported_entity: "INTERNAL_PROJECT"
    deny_list:
      - "Titan"
      - "Sapphire"
      - "Nightingale"
      - "Ironclad"
    deny_list_score: 1.0

  - name: "Employee ID Recognizer"
    supported_language: "en"
    supported_entity: "EMPLOYEE_ID"
    patterns:
      - name: "emp_id"
        regex: "\\bEMP-\\d{5}\\b"
        score: 0.9
    context:
      - "employee"
      - "emp"
      - "staff"
      - "worker"

  - name: "Policy Number Recognizer"
    supported_language: "en"
    supported_entity: "POLICY_NUMBER"
    patterns:
      - name: "policy_format"
        regex: "\\bPOL-[A-Z]{2}-\\d{6}\\b"
        score: 0.95
    context:
      - "policy"
      - "insurance"
      - "coverage"
      - "claim"

Load them into the analyzer:

将它们加载到分析器中：

(Note: The original text ended abruptly here; typically, you would use analyzer.registry.add_recognizers_from_yaml(file_path) to load these.)