Skip to content

[Docs]: How does cognify() handle entity property conflicts and missing properties during deduplication? #1831

@schmelli

Description

@schmelli

Documentation Type

Missing documentation

Documentation Location

📚 Related

Issue Description

📋 Summary

The documentation states that cognify() with incremental_loading=True skips already processed data, but it's unclear how the deduplication logic handles entity updates when:

  • A new data ingestion adds missing properties to an existing entity
  • A new data ingestion provides different values for existing properties

🎯 Use Case

I'm building a knowledge graph for specific backpacking products where product specifications come from multiple sources (manufacturer specs, reviews, retailer data). A typical workflow would be:

# Initial ingestion - partial data
await cognee.add({
    "name": "Osprey Atmos 65",
    "type": "backpack",
    "weight": 2040  # grams
}, "gear_products")
await cognee.cognify()

# Later ingestion - additional/conflicting data
await cognee.add({
    "name": "Osprey Atmos 65",
    "type": "backpack",
    "capacity": 65,  # NEW property
    "weight": 2100   # CONFLICTING value
}, "gear_products")
await cognee.cognify(incremental_loading=True)

❓ Questions

Scenario A - Missing Properties:
If an entity already exists but is missing certain properties, will cognify() add the missing properties to the existing entity?

Scenario B - Conflicting Values:
If an entity already exists with a property value that differs from new data, which strategy does Cognee use?

  • Last Write Wins: New value overwrites old value
  • First Write Wins: Old value is preserved, new value ignored
  • Merge/Append: Both values are somehow preserved (e.g., as list)
  • Error/Skip: The entire entity is skipped due to conflict

Scenario C - Partial Updates:
Is there a way to explicitly update specific properties of an existing entity without re-processing the entire dataset?

🔍 Current Understanding

From the documentation and code inspection:

  • add_data_points() calls deduplicate_nodes_and_edges(nodes, edges) before inserting to graph DB
  • DataPoints receive UUID, version, and timestamp
  • The Redis blog post mentions "deduplicates identical entities"

However, the exact merge strategy for non-identical entities (same ID, different properties) is not documented.

Suggested Improvement

💡 Suggestions

This behavior should ideally be:

  1. Documented in the core concepts or cognify operation documentation
  2. Configurable via a parameter (e.g., merge_strategy="update"|"ignore"|"error")
  3. Testable with clear examples in the test suite

🧪 Proposed Test Cases

# Test 1: Missing property addition
await cognee.add({"id": "product-1", "name": "Test", "weight": 500})
await cognee.cognify()

await cognee.add({"id": "product-1", "capacity": 40})
await cognee.cognify(incremental_loading=True)

result = await cognee.search("product-1 properties")
# Expected: Should entity have both weight AND capacity?

# Test 2: Property value conflict
await cognee.add({"id": "product-1", "name": "Test", "weight": 500})
await cognee.cognify()

await cognee.add({"id": "product-1", "name": "Test", "weight": 520})
await cognee.cognify(incremental_loading=True)

result = await cognee.search("product-1 weight")
# Expected: Is weight 500, 520, or [500, 520]?

Additional Context

🏷️ Environment

  • Cognee version: 0.4.0
  • Graph DB: Memgraph
  • Use case: Multi-source product knowledge graph

Pre-submission Checklist

  • I have searched existing issues to ensure this documentation issue hasn't been reported already
  • I have provided a clear description of the documentation issue
  • I have specified the location of the documentation issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions