-
Notifications
You must be signed in to change notification settings - Fork 884
Description
Documentation Type
Missing documentation
Documentation Location
📚 Related
- Documentation: https://docs.cognee.ai/core-concepts/main-operations/cognify
- Code reference:
cognee/modules/graph/utils/deduplicate_nodes_and_edges - Related to incremental loading feature
Issue Description
📋 Summary
The documentation states that cognify() with incremental_loading=True skips already processed data, but it's unclear how the deduplication logic handles entity updates when:
- A new data ingestion adds missing properties to an existing entity
- A new data ingestion provides different values for existing properties
🎯 Use Case
I'm building a knowledge graph for specific backpacking products where product specifications come from multiple sources (manufacturer specs, reviews, retailer data). A typical workflow would be:
# Initial ingestion - partial data
await cognee.add({
"name": "Osprey Atmos 65",
"type": "backpack",
"weight": 2040 # grams
}, "gear_products")
await cognee.cognify()
# Later ingestion - additional/conflicting data
await cognee.add({
"name": "Osprey Atmos 65",
"type": "backpack",
"capacity": 65, # NEW property
"weight": 2100 # CONFLICTING value
}, "gear_products")
await cognee.cognify(incremental_loading=True)❓ Questions
Scenario A - Missing Properties:
If an entity already exists but is missing certain properties, will cognify() add the missing properties to the existing entity?
Scenario B - Conflicting Values:
If an entity already exists with a property value that differs from new data, which strategy does Cognee use?
- Last Write Wins: New value overwrites old value
- First Write Wins: Old value is preserved, new value ignored
- Merge/Append: Both values are somehow preserved (e.g., as list)
- Error/Skip: The entire entity is skipped due to conflict
Scenario C - Partial Updates:
Is there a way to explicitly update specific properties of an existing entity without re-processing the entire dataset?
🔍 Current Understanding
From the documentation and code inspection:
add_data_points()callsdeduplicate_nodes_and_edges(nodes, edges)before inserting to graph DB- DataPoints receive UUID, version, and timestamp
- The Redis blog post mentions "deduplicates identical entities"
However, the exact merge strategy for non-identical entities (same ID, different properties) is not documented.
Suggested Improvement
💡 Suggestions
This behavior should ideally be:
- Documented in the core concepts or cognify operation documentation
- Configurable via a parameter (e.g.,
merge_strategy="update"|"ignore"|"error") - Testable with clear examples in the test suite
🧪 Proposed Test Cases
# Test 1: Missing property addition
await cognee.add({"id": "product-1", "name": "Test", "weight": 500})
await cognee.cognify()
await cognee.add({"id": "product-1", "capacity": 40})
await cognee.cognify(incremental_loading=True)
result = await cognee.search("product-1 properties")
# Expected: Should entity have both weight AND capacity?
# Test 2: Property value conflict
await cognee.add({"id": "product-1", "name": "Test", "weight": 500})
await cognee.cognify()
await cognee.add({"id": "product-1", "name": "Test", "weight": 520})
await cognee.cognify(incremental_loading=True)
result = await cognee.search("product-1 weight")
# Expected: Is weight 500, 520, or [500, 520]?Additional Context
🏷️ Environment
- Cognee version: 0.4.0
- Graph DB: Memgraph
- Use case: Multi-source product knowledge graph
Pre-submission Checklist
- I have searched existing issues to ensure this documentation issue hasn't been reported already
- I have provided a clear description of the documentation issue
- I have specified the location of the documentation issue