Skip to content

Extracting Data with extract() and extract_multiple()

The extraction methods provide powerful ways to parse structured data from text, constructing typed objects from natural language descriptions. These methods can work with any Python class that has a proper constructor.

Basic Usage

Single Instance Extraction

class Person:
    def __init__(self, name: str, age: int | None = None):
        self.name = name
        self.age = age

text = "John is a 30-year-old software developer."
person = await agent.talk.extract(text, as_type=Person)
print(f"Found: {person.name}, {person.age}")  # "John", 30

Multiple Instances

text = """
Team members:
- Alice (25) leads development
- Bob (32) handles testing
- Carol (28) manages deployment
"""
team = await agent.talk.extract_multiple(text, as_type=Person)
for member in team:
    print(f"Member: {member.name}, {member.age}")

Extraction Modes

Both methods support two extraction modes:

Structured Mode (Default)

# More robust, single-round extraction
result = await agent.talk.extract(
    text,
    as_type=Person,
    mode="structured"  # default
)

Tool Calls Mode

# Legacy mode, might be useful for experimentation
result = await agent.talk.extract(
    text,
    as_type=Person,
    mode="tool_calls"
)

Working with Complex Types

Nested Objects

class Address:
    def __init__(self, street: str, city: str):
        self.street = street
        self.city = city

class Employee:
    def __init__(self, name: str, address: Address):
        self.name = name
        self.address = address

text = "Alice lives at 123 Main St in New York"
employee = await agent.talk.extract(text, as_type=Employee)

Custom Prompts

You can provide custom prompts for more specific extraction:

result = await agent.talk.extract(
    text,
    as_type=Person,
    prompt="""
    Extract person information focusing on:
    1. Full name (first and last)
    2. Exact age if mentioned
    3. Ignore titles or honorifics
    """
)

For multiple extraction:

results = await agent.talk.extract_multiple(
    text,
    as_type=Person,
    prompt="""
    Extract all team members mentioned.
    Look for:
    - Names in any format
    - Ages if provided
    - Skip generic references
    """
)

Technical Details: How It Works

Structured Mode (Default)

The structured mode works by converting the target class's constructor into a Pydantic model and using it for validation:

  1. Constructor Analysis

    # Your class
    class Person:
        def __init__(self, name: str, age: int | None = None):
            self.name = name
            self.age = age
    
    # Internally converted to Pydantic model
    class PersonModel(BaseModel):
        name: str
        age: int | None = None
    

  2. Response Container

    # For single extraction
    class Extraction(BaseModel):
        instance: PersonModel
    
    # For multiple extraction
    class Extraction(BaseModel):
        instances: list[PersonModel]
    

  3. LLM Interaction

  4. Single structured response from LLM

  5. LLM sees complete schema upfront
  6. Natural mapping of fields
  7. One-shot validation through Pydantic

  8. Instance Creation

# Convert validated model to target class
result = as_type(**model_instance.model_dump())

Benefits: - Single round-trip to LLM - Clear schema validation - Better handling of complex types - More robust extraction

Tool Calls Mode (Legacy)

The tool calls mode uses function calling to construct objects:

  1. Constructor to Function Schema

    # Your class
    class Person:
        def __init__(self, name: str, age: int | None = None):
            self.name = name
            self.age = age
    
    # Converted to OpenAI function schema
    {
        "name": "Person",
        "description": "Create a Person instance",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer", "nullable": true}
            },
            "required": ["name"]
        }
    }
    

  2. Constructor Tool

async def construct(**kwargs: Any) -> T:
    """Dynamic constructor tool."""
    return as_type(**kwargs)
  1. LLM Interaction

  2. LLM makes function calls

  3. For multiple: Repeated calls to add instances
  4. Each call validated separately
  5. State maintained between calls

  6. Instance Creation

  7. Through tool call execution

  8. One instance per call
  9. Results collected in list (for multiple)

Trade-offs:

  • More flexible but less robust
  • Multiple round-trips for multiple instances
  • Harder to maintain complex state
  • More points of potential failure

When to Use Which

This is a tough question. The Tool call mechanism was implemented first by MarvinAI, but seems less robust compared to our new StructuredResponse approach. It doesnt include a sequence of tool calls and will use the validation capabilites of Pydantic-AI in case that is the provider chosen.