Data entry is the silent killer of productivity. It’s boring, error-prone, and takes time away from work that actually matters.

Let’s fix that with Python and Claude.

The Scenario

Imagine you receive PDFs, emails, and spreadsheets with customer information that needs to go into your CRM. The formats vary, the data quality is inconsistent, and it takes your team hours every day.

The Solution: Intelligent Parsing

Instead of writing rigid parsing rules, we use Claude to understand the intent of the data:

import anthropic
from pathlib import Path

client = anthropic.Anthropic()

def extract_customer_data(document_text: str) -> dict:
    """
    Extract structured customer data from unstructured text.
    """
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"""Extract customer information from this document.
            Return JSON with these fields:
            - name (string)
            - email (string)
            - phone (string, normalized to +1XXXXXXXXXX)
            - company (string)
            - notes (string, any relevant context)

            Document:
            {document_text}
            """
        }]
    )

    return json.loads(response.content[0].text)

Handling Multiple Formats

The beauty of using Claude is format flexibility:

def process_any_format(file_path: Path) -> dict:
    """Handle PDFs, emails, CSVs, whatever."""

    # Extract text based on file type
    if file_path.suffix == '.pdf':
        text = extract_pdf_text(file_path)
    elif file_path.suffix == '.eml':
        text = extract_email_text(file_path)
    elif file_path.suffix in ['.csv', '.xlsx']:
        text = extract_spreadsheet_text(file_path)
    else:
        text = file_path.read_text()

    # Claude handles the rest
    return extract_customer_data(text)

Adding Validation

Trust but verify. We validate Claude’s output before inserting:

from pydantic import BaseModel, EmailStr, validator

class CustomerData(BaseModel):
    name: str
    email: EmailStr
    phone: str
    company: str
    notes: str = ""

    @validator('phone')
    def normalize_phone(cls, v):
        # Strip non-digits and format
        digits = ''.join(c for c in v if c.isdigit())
        if len(digits) == 10:
            return f"+1{digits}"
        return v

def safe_extract(document_text: str) -> CustomerData | None:
    """Extract and validate in one step."""
    try:
        raw_data = extract_customer_data(document_text)
        return CustomerData(**raw_data)
    except Exception as e:
        log_error(f"Extraction failed: {e}")
        return None

The Complete Pipeline

async def process_inbox():
    """Process all new documents in the inbox."""
    inbox = Path("./inbox")
    processed = Path("./processed")
    errors = Path("./errors")

    for file in inbox.iterdir():
        try:
            # Extract data
            customer = safe_extract(file.read_text())

            if customer:
                # Insert into CRM
                await crm.create_customer(customer.dict())

                # Move to processed
                file.rename(processed / file.name)
            else:
                # Move to errors for manual review
                file.rename(errors / file.name)

        except Exception as e:
            log_error(f"Failed to process {file}: {e}")
            file.rename(errors / file.name)

Results

For one client processing ~200 documents daily:

  • Time spent: 6 hours → 15 minutes (review only)
  • Error rate: 8% → 0.5%
  • Consistency: 100% normalized formatting

Pro Tips

  1. Batch similar documents - Claude is faster with context
  2. Cache common patterns - Don’t re-extract known formats
  3. Human review queue - Low-confidence extractions go to humans
  4. Feedback loop - Log corrections to improve prompts

Drowning in data entry? Let’s automate it.