Data entry is the silent killer of productivity. It’s boring, error-prone, and takes time away from work that actually matters.
Let’s fix that with Python and Claude.
The Scenario
Imagine you receive PDFs, emails, and spreadsheets with customer information that needs to go into your CRM. The formats vary, the data quality is inconsistent, and it takes your team hours every day.
The Solution: Intelligent Parsing
Instead of writing rigid parsing rules, we use Claude to understand the intent of the data:
import anthropic
from pathlib import Path
client = anthropic.Anthropic()
def extract_customer_data(document_text: str) -> dict:
"""
Extract structured customer data from unstructured text.
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Extract customer information from this document.
Return JSON with these fields:
- name (string)
- email (string)
- phone (string, normalized to +1XXXXXXXXXX)
- company (string)
- notes (string, any relevant context)
Document:
{document_text}
"""
}]
)
return json.loads(response.content[0].text)
Handling Multiple Formats
The beauty of using Claude is format flexibility:
def process_any_format(file_path: Path) -> dict:
"""Handle PDFs, emails, CSVs, whatever."""
# Extract text based on file type
if file_path.suffix == '.pdf':
text = extract_pdf_text(file_path)
elif file_path.suffix == '.eml':
text = extract_email_text(file_path)
elif file_path.suffix in ['.csv', '.xlsx']:
text = extract_spreadsheet_text(file_path)
else:
text = file_path.read_text()
# Claude handles the rest
return extract_customer_data(text)
Adding Validation
Trust but verify. We validate Claude’s output before inserting:
from pydantic import BaseModel, EmailStr, validator
class CustomerData(BaseModel):
name: str
email: EmailStr
phone: str
company: str
notes: str = ""
@validator('phone')
def normalize_phone(cls, v):
# Strip non-digits and format
digits = ''.join(c for c in v if c.isdigit())
if len(digits) == 10:
return f"+1{digits}"
return v
def safe_extract(document_text: str) -> CustomerData | None:
"""Extract and validate in one step."""
try:
raw_data = extract_customer_data(document_text)
return CustomerData(**raw_data)
except Exception as e:
log_error(f"Extraction failed: {e}")
return None
The Complete Pipeline
async def process_inbox():
"""Process all new documents in the inbox."""
inbox = Path("./inbox")
processed = Path("./processed")
errors = Path("./errors")
for file in inbox.iterdir():
try:
# Extract data
customer = safe_extract(file.read_text())
if customer:
# Insert into CRM
await crm.create_customer(customer.dict())
# Move to processed
file.rename(processed / file.name)
else:
# Move to errors for manual review
file.rename(errors / file.name)
except Exception as e:
log_error(f"Failed to process {file}: {e}")
file.rename(errors / file.name)
Results
For one client processing ~200 documents daily:
- Time spent: 6 hours → 15 minutes (review only)
- Error rate: 8% → 0.5%
- Consistency: 100% normalized formatting
Pro Tips
- Batch similar documents - Claude is faster with context
- Cache common patterns - Don’t re-extract known formats
- Human review queue - Low-confidence extractions go to humans
- Feedback loop - Log corrections to improve prompts
Drowning in data entry? Let’s automate it.