Google LangExtract: Step-by-Step Tutorial, Examples, and Use Cases

Google LangExtract

Google LangExtract is a powerful open-source Python library from Google that leverages large language models (LLMs) to extract structured information from unstructured text with precise source grounding and interactive visualization.

Step-by-Step Tutorial: Getting Started with LangExtract

Step 1: Set Up Your Environment

Install Python 3.10 or 3.11.
Install LangExtract using pip.
For Mac users, install libmagic.
Set the API key for cloud models.
Verify installation.

bash
python -m venv langextract_env
source langextract_env/bin/activate
pip install langextract
brew install libmagic  # Mac users
export LANGEXTRACT_API_KEY="your-google-ai-studio-key"
python -c "import langextract; print(langextract.__version__)"

Step 2: Understand Core Concepts

Define extraction schemas using JSON schema with few-shot examples.
Configure extractor with model, chunk size, and multi-pass settings.
Handle outputs with JSON and interactive HTML visualization.

Step 3: Run Your First Extraction

python
import langextract as le

schema = {
    "type": "object",
    "properties": {
        "entities": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "type": {"type": "string", "enum": ["person", "organization", "location"]},
                    "source_offset": {"type": "integer"}
                }
            }
        }
    }
}

text = "Gagan Goswami works as a Lead Software Engineer in Columbus, Ohio, at a tech firm focused on AWS and Gen AI."

extractor = le.Extractor(model="gemini-2.5-flash", schema=schema)
results = extractor.extract(text)
print(results)
le.visualize(results, output_file="first_extraction.html")

Run It: python first_extraction.py

What Happens: The LLM extracts entities like {"name": "Gagan Goswami", "type": "person", "source_offset": 0}, grounded to the exact text position. The HTML file lets you hover over extractions for verification.

Step 4: Scale for Production

Handle Long Documents: Set chunk_size=10000 and max_passes=2 in the Extractor for docs over 25,000 words.
Parallel Processing: Use max_workers=8 for multi-threaded extraction on AWS EC2 instances.
Dockerize for AWS Deployment:

dockerfile
FROM python:3.11-slim
RUN pip install langextract
COPY your_script.py .
CMD ["python", "your_script.py"]

Build and run:

docker build -t langextract-app

then

docker run -e LANGEXTRACT_API_KEY=your-key langextract-app

Integration Tip: In your Java/Spring apps, call this Python service via REST APIs or AWS Lambda for Agentic AI flows.

Step 5: Troubleshoot and Optimize

Add more few-shot examples for better extractions.
Monitor API costs or switch to local models.
Use test suite for validation.

Practical Examples with Code

Example 1: Extracting Technical Specs from AWS Docs

python
import langextract as le

schema = {
    "type": "object",
    "properties": {
        "instances": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "type": {"type": "string"},
                    "vCPU": {"type": "integer"},
                    "cost_per_hour": {"type": "number"},
                    "source_start": {"type": "integer"}
                }
            }
        }
    }
}

text = "EC2 t3.micro offers 2 vCPU at $0.0104 per hour. m5.large provides 2 vCPU at $0.096 per hour."

extractor = le.Extractor(model="gemini-2.5-flash", schema=schema, chunk_size=2000)
results = extractor.extract(text)
le.export_to_jsonl(results, "aws_specs.jsonl")

Example 2: Multi-Pass Medication Extraction from Clinical Notes

python
import langextract as le

schema = {
    "type": "object",
    "properties": {
        "medications": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "drug_name": {"type": "string"},
                    "dosage": {"type": "string"},
                    "source_end": {"type": "integer"}
                }
            }
        }
    }
}

text = "Patient prescribed aspirin 81mg daily and metformin 500mg twice daily for diabetes management."

extractor = le.Extractor(model="gemini-2.5-pro", schema=schema, max_passes=2, max_workers=4)
results = extractor.extract(text)
le.visualize(results, "meds_visual.html")

Example 3: Relationship Mapping in Research Papers

python
import langextract as le

schema = {
    "type": "object",
    "properties": {
        "findings": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "hypothesis": {"type": "string"},
                    "result": {"type": "string"},
                    "citation_offset": {"type": "integer"}
                }
            }
        }
    }
}

text = "Study shows Gen AI improves efficiency by 40% (Smith et al., 2025). However, challenges remain in scalability."

extractor = le.Extractor(model="gemini-2.5-flash", schema=schema, use_world_knowledge=True)
results = extractor.extract(text)
print(results)

Use Cases Across Domains

Healthcare - Radiology Report Analysis

Apply LangExtract to structure X-ray reports for AI-assisted diagnostics. Define a schema for findings (e.g., "nodule size", "location"), process batches of reports, and visualize for radiologists. In your Gen AI work, integrate this with AWS S3 for storing visualized HTML outputs, potentially reducing report review time in clinical trials.

Legal - Contract Clause Extraction

Automate risk assessment in contracts by extracting clauses like "termination rights" or "liability limits." Use multi-pass on long PDFs (convert to text first), export to JSONL, and feed into a Spring-based dashboard. Scales well for due diligence in enterprise mergers.

Research and Gen AI - Literature Review Automation

Extract hypotheses, methods, and results from AI papers to build knowledge graphs. Process a folder of PDFs, use parallel workers on AWS EC2, and visualize relationships. Ideal for your Agentic AI projects, like summarizing Spring framework evolutions.

Business - Customer Feedback Analytics

Analyze support tickets to extract sentiments and issues. Schema for "complaint type" and "resolution," with grounding to quotes. Deploy as a Lambda function for real-time processing in your AWS setup, turning feedback into actionable metrics.

These steps, examples, and use cases should give you a solid foundation to experiment with LangExtract in your tech stack. If you're building agents, try chaining extractions with other Gen AI tools for more complex workflows. Let me know if you need tweaks for specific integrations!