Azure OpenAIAIAzureDocument Intelligence

Building a Legal Document Classifier with Azure OpenAI

5 November 2024·3 min read·Debashis Mandal

A legal operations team manually classified and tagged every incoming contract: contract type, counterparty, effective date, governing law, renewal terms. With 200+ documents a week, this consumed two full-time paralegal hours daily. The ask was simple: automate it. Here's how we built it.

The Classification Problem

Legal documents are structurally diverse. An NDA looks nothing like a SaaS subscription agreement, which looks nothing like an employment contract. Traditional rule-based classifiers (keyword matching, regex) fail on the long tail of edge cases. GPT-4o handles the variance naturally — it understands context, not just pattern matches.

Our target output for each document:

{
  "contract_type": "Non-Disclosure Agreement",
  "counterparty": "Acme Corp",
  "effective_date": "2024-09-01",
  "expiry_date": "2026-09-01",
  "governing_law": "Delaware, USA",
  "auto_renewal": true,
  "renewal_notice_days": 30,
  "confidence": 0.94
}

Pipeline Architecture

SharePoint Library → Azure Function (trigger) → Document Intelligence → Chunking → GPT-4o → SharePoint Metadata

Step 1: Document Extraction

Azure AI Document Intelligence (formerly Form Recognizer) handles PDF and Word extraction better than any DIY approach. The layout model preserves table structure and reading order, which matters for contracts with schedules and annexures.

from azure.ai.documentintelligence import DocumentIntelligenceClient

client = DocumentIntelligenceClient(endpoint, credential)
poller = client.begin_analyze_document("prebuilt-layout", document_stream)
result = poller.result()
full_text = "\n".join([p.content for p in result.paragraphs])

Step 2: Smart Chunking

Most contracts fit within GPT-4o's 128K context window, but we chunk anyway — it keeps latency predictable and cost linear. We extract the first 8,000 tokens (covers recitals, definitions, and key clauses) and the last 2,000 tokens (covers signatures, governing law, and renewal terms).

This "head + tail" strategy gets 92% classification accuracy at roughly half the token cost of sending the full document.

Step 3: Structured Output Extraction

GPT-4o's structured outputs (JSON mode with a schema) removes parsing fragility entirely:

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-08-01-preview"
)

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": CLASSIFICATION_SYSTEM_PROMPT},
        {"role": "user", "content": document_excerpt}
    ],
    response_format=ContractMetadata  # Pydantic model
)

metadata = response.choices[0].message.parsed

The Pydantic model enforces the schema at the SDK level — if the model returns a malformed date, it raises a validation error before the bad data reaches SharePoint.

Step 4: Writing Metadata Back to SharePoint

We use Microsoft Graph to write the extracted metadata as SharePoint column values:

import httpx

headers = {"Authorization": f"Bearer {access_token}", "Content-Type": "application/json"}
httpx.patch(
    f"https://graph.microsoft.com/v1.0/sites/{site_id}/lists/{list_id}/items/{item_id}/fields",
    json={
        "ContractType": metadata.contract_type,
        "Counterparty": metadata.counterparty,
        "EffectiveDate": metadata.effective_date,
        "GoverningLaw": metadata.governing_law,
    },
    headers=headers
)

Accuracy and Human-in-the-Loop

Raw accuracy on our test set of 500 contracts: 94.2% for contract type, 91.7% for date fields. The confidence score (returned by the model as a self-assessment) proved to be a reliable triage signal — documents with confidence below 0.80 are flagged for human review.

This hybrid approach is the right one for legal contexts. The goal isn't to eliminate human judgment; it's to direct human attention to the 8% of cases that genuinely need it.

Cost

At GPT-4o pricing, classifying a 10-page contract costs roughly $0.03. At 200 documents/week, that's $6/week in AI cost — compared to ~$800/week in paralegal time. The ROI case writes itself.

Lessons Learned

Structured outputs > prompt engineering for extraction. Before JSON mode existed, we spent days prompt-engineering reliable JSON output. Structured outputs solved this in an afternoon.
Document Intelligence is worth the cost. DIY PDF parsing with PyPDF2 misses tables, scrambles multi-column layouts, and chokes on scanned documents. The prebuilt-layout model handles all of these.
Version your prompts. Classification prompts are production assets. Store them in source control, version them, and A/B test changes against a held-out evaluation set before deploying.

← back to all posts