Building a Legal Document Classifier with Azure OpenAI
A legal operations team manually classified and tagged every incoming contract: contract type, counterparty, effective date, governing law, renewal terms. With 200+ documents a week, this consumed two full-time paralegal hours daily. The ask was simple: automate it. Here's how we built it.
The Classification Problem
Legal documents are structurally diverse. An NDA looks nothing like a SaaS subscription agreement, which looks nothing like an employment contract. Traditional rule-based classifiers (keyword matching, regex) fail on the long tail of edge cases. GPT-4o handles the variance naturally — it understands context, not just pattern matches.
Our target output for each document:
{
"contract_type": "Non-Disclosure Agreement",
"counterparty": "Acme Corp",
"effective_date": "2024-09-01",
"expiry_date": "2026-09-01",
"governing_law": "Delaware, USA",
"auto_renewal": true,
"renewal_notice_days": 30,
"confidence": 0.94
}
Pipeline Architecture
SharePoint Library → Azure Function (trigger) → Document Intelligence → Chunking → GPT-4o → SharePoint Metadata
Step 1: Document Extraction
Azure AI Document Intelligence (formerly Form Recognizer) handles PDF and Word extraction better than any DIY approach. The layout model preserves table structure and reading order, which matters for contracts with schedules and annexures.
from azure.ai.documentintelligence import DocumentIntelligenceClient
client = DocumentIntelligenceClient(endpoint, credential)
poller = client.begin_analyze_document("prebuilt-layout", document_stream)
result = poller.result()
full_text = "\n".join([p.content for p in result.paragraphs])
Step 2: Smart Chunking
Most contracts fit within GPT-4o's 128K context window, but we chunk anyway — it keeps latency predictable and cost linear. We extract the first 8,000 tokens (covers recitals, definitions, and key clauses) and the last 2,000 tokens (covers signatures, governing law, and renewal terms).
This "head + tail" strategy gets 92% classification accuracy at roughly half the token cost of sending the full document.
Step 3: Structured Output Extraction
GPT-4o's structured outputs (JSON mode with a schema) removes parsing fragility entirely:
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-08-01-preview"
)
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": CLASSIFICATION_SYSTEM_PROMPT},
{"role": "user", "content": document_excerpt}
],
response_format=ContractMetadata # Pydantic model
)
metadata = response.choices[0].message.parsed
The Pydantic model enforces the schema at the SDK level — if the model returns a malformed date, it raises a validation error before the bad data reaches SharePoint.
Step 4: Writing Metadata Back to SharePoint
We use Microsoft Graph to write the extracted metadata as SharePoint column values:
import httpx
headers = {"Authorization": f"Bearer {access_token}", "Content-Type": "application/json"}
httpx.patch(
f"https://graph.microsoft.com/v1.0/sites/{site_id}/lists/{list_id}/items/{item_id}/fields",
json={
"ContractType": metadata.contract_type,
"Counterparty": metadata.counterparty,
"EffectiveDate": metadata.effective_date,
"GoverningLaw": metadata.governing_law,
},
headers=headers
)
Accuracy and Human-in-the-Loop
Raw accuracy on our test set of 500 contracts: 94.2% for contract type, 91.7% for date fields. The confidence score (returned by the model as a self-assessment) proved to be a reliable triage signal — documents with confidence below 0.80 are flagged for human review.
This hybrid approach is the right one for legal contexts. The goal isn't to eliminate human judgment; it's to direct human attention to the 8% of cases that genuinely need it.
Cost
At GPT-4o pricing, classifying a 10-page contract costs roughly $0.03. At 200 documents/week, that's $6/week in AI cost — compared to ~$800/week in paralegal time. The ROI case writes itself.
Lessons Learned
- Structured outputs > prompt engineering for extraction. Before JSON mode existed, we spent days prompt-engineering reliable JSON output. Structured outputs solved this in an afternoon.
- Document Intelligence is worth the cost. DIY PDF parsing with PyPDF2 misses tables, scrambles multi-column layouts, and chokes on scanned documents. The prebuilt-layout model handles all of these.
- Version your prompts. Classification prompts are production assets. Store them in source control, version them, and A/B test changes against a held-out evaluation set before deploying.