Bookmark Delta ETL Pipeline

A comprehensive ETL pipeline for bookmark data that implements delta load functionality inspired by j-94/deltaload, with AI enrichment using Vercel AI SDK.

Features

🔄 Delta Load ETL

Change Detection: Tracks changes using content hashing
Incremental Processing: Only processes new or modified bookmarks
State Management: Maintains processing state between runs
Snapshot Recovery: Creates snapshots for data recovery
Change History: Logs all additions, modifications, and deletions

🤖 AI Enrichment

Content Analysis: Deep analysis of webpage content
Topic Extraction: Identifies main topics and entities
Social Signals: Analyzes virality and engagement potential
Temporal Analysis: Tracks content lifecycle and relevance
Quality Scoring: Rates content quality and uniqueness
Clustering: Groups similar bookmarks automatically

📊 DocETL Integration

Batch Processing: Efficient processing of large datasets
Intelligence Reports: Generates comprehensive insights
Pattern Recognition: Identifies content patterns and gaps
Recommendation Engine: Suggests curation actions

Installation

cd bookmark-delta-etl
bun install
bun run check

Usage

Basic ETL Run

bun run etl bookmarks.json

Full Enrichment

bun run etl:full bookmarks.json

Options

--no-ai: Disable AI enrichment
--no-docetl: Disable DocETL processing
--output <dir>: Output directory (default: ./bookmark-delta-data)
--content: Include webpage content fetching
--social: Include social signal analysis
--temporal: Include temporal analysis
--batch <size>: AI enrichment batch size (default: 10)

Examples

# Process multiple files
bun run etl bookmarks.json raindrop-export.json

# AI-only enrichment with small batches
bun run etl --no-docetl --batch 5 *.json

# Full analysis with custom output
bun run etl --output ./my-data --content --social bookmarks.csv

Data Sources & .env

Copy .env.example to .env and fill in what you have:

RAINDROP_API_KEY (Raindrop.io REST)
GITHUB_TOKEN (GitHub PAT for starred repos)
X_CT0, X_AUTH_TOKEN, X_KDT (optional; X.com cookies for Grok scraping)
OPENAI_API_KEY, ANTHROPIC_API_KEY (optional; enrichment)
UNIFIED_CHAT_PATH (optional; path to your unified chat JSON)

Then check credentials and guidance:

bun check-credentials.ts --setup --test

Fetch + Ingest (Unified)

Fetch from APIs (Raindrop, GitHub) and ingest into the unified delta store:

# Fetch sources individually
bun bin/fetch-raindrop.ts
bun bin/fetch-github-stars.ts

# Or fetch everything available and ingest to unified
bun bin/fetch-all.ts

# Verify + report
bun unified/verify.ts ./unified
bun unified/report.ts ./unified

Raw API dumps will be written to raw/. Unified outputs land in unified/.

Data Flow

Extract: Load bookmarks from JSON/CSV sources
Transform: Enrich with metadata and calculate metrics
Delta Detection: Compare with previous state
AI Enrichment: Analyze content and generate insights
Load: Store enriched data with change tracking
Report: Generate intelligence reports and statistics

Output Structure

bookmark-delta-data/
├── bookmarks.json              # Current bookmark state
├── delta-state.json           # Delta load state
├── changes-*.json             # Change logs
├── bookmark-clusters.json     # AI-generated clusters
├── bookmark-insights.json     # AI insights
├── etl-report.md             # Summary report
└── snapshot-*.json           # Recovery snapshots

Enhanced Bookmark Schema

The pipeline enriches bookmarks with:

Content Metrics: Length, reading time, language
Technical Details: Response time, SSL validity, mobile friendliness
AI Analysis: Topics, sentiment, readability, quality scores
Social Signals: Sharing potential, virality, engagement
Temporal Data: Content lifespan, update frequency, stability
Delta Tracking: Change frequency, content stability

Integration with Search UI

The enriched data can be used with the search interface:

// Load enriched bookmarks
const bookmarks = await fs.readFile('bookmark-delta-data/bookmarks.json');
const clusters = await fs.readFile('bookmark-delta-data/bookmark-clusters.json');

// Use in search API
app.get('/api/bookmarks/search', async (req, res) => {
  const { q } = req.query;
  // Search through enriched bookmarks
});

Monitoring & Statistics

Get ETL statistics:

const stats = await etl.getStatistics();
console.log(`Total bookmarks: ${stats.totalBookmarks}`);
console.log(`Recent changes: ${stats.changesLast24h}`);

Best Practices

Regular Runs: Schedule daily/weekly runs to track changes
Batch Sizes: Use smaller batches (5-10) for better AI quality
Snapshots: Create snapshots before major operations
Content Fetching: Enable selectively to avoid rate limits
Review Reports: Check attention_required bookmarks regularly

Troubleshooting

Rate Limits: Reduce batch size or add delays
Memory Issues: Process in smaller chunks
API Errors: Check OpenAI API key and quota
DocETL Errors: Ensure DocETL is installed (pip install docetl)

Contributing

This pipeline follows the delta load patterns from j-94/deltaload. Contributions welcome!

Unified Format + Delta Loading (New)

In addition to bookmark-specific ETL, the repo includes a source-agnostic unified schema and delta loader under unified/ to normalize multiple data types (bookmarks and unified chat to start).

Unified schema: unified/unified-schema.ts (Zod)
Normalizers: unified/normalize.ts (bookmarks, unified chat)
Delta engine: unified/unified-delta.ts (snapshot/append modes)
CLI: unified/cli.ts
Verification: unified/verify.ts

Usage examples:

# Normalize bookmarks JSONL and write unified outputs to ./unified
bun unified/cli.ts bookmark:more-bookmarks.jsonl

# Process unified chat (conversations[]) and bookmarks together in snapshot mode
bun unified/cli.ts --out ./unified --mode snapshot \
  bookmark:/Users/imac/Desktop/Donkeyv1/context/bookmark/docetl_bookmarkv0/processed_bookmarks_for_docetl.jsonl \
  chat_unified:/Users/imac/Desktop/Donkeyv1/context/chat_history/unified_data/unified_chat_properly_separated_20250613_150354.json

# Verify unified outputs
bun unified/verify.ts ./unified

# Generate a report and markdown snapshot
bun unified/report.ts ./unified

Outputs:

unified/run-<timestamp>.jsonl — normalized records
unified/changes-<timestamp>.json — delta summary (added/modified/deleted/unchanged)
unified/unified-manifest.json — persistent hashes for delta detection
unified/REPORT.md — human-friendly summary written by report tool

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
fetchers		fetchers
src		src
unified		unified
.env.example		.env.example
.gitignore		.gitignore
DATA_ACCOUNTING_SUMMARY.md		DATA_ACCOUNTING_SUMMARY.md
FIELD_MAPPING_REPORT.md		FIELD_MAPPING_REPORT.md
README.md		README.md
TODO.md		TODO.md
ai-sdk-bookmark-enricher.ts		ai-sdk-bookmark-enricher.ts
bookmark-delta-docetl.yaml		bookmark-delta-docetl.yaml
bun.lock		bun.lock
credentials-manager.ts		credentials-manager.ts
delta-load-guide.md		delta-load-guide.md
delta-load-pipeline.ts		delta-load-pipeline.ts
delta-monitor.ts		delta-monitor.ts
delta-scheduler.ts		delta-scheduler.ts
demo-delta-load.sh		demo-delta-load.sh
demo-twitter-grok.sh		demo-twitter-grok.sh
enhanced-bookmark-schema.ts		enhanced-bookmark-schema.ts
package.json		package.json
run-bookmark-etl-with-twitter.ts		run-bookmark-etl-with-twitter.ts
run-bookmark-etl.ts		run-bookmark-etl.ts
setup-credentials.sh		setup-credentials.sh
test-delta.json		test-delta.json
test-field-coverage.ts		test-field-coverage.ts
tsconfig.json		tsconfig.json
twitter-grok-fetcher.ts		twitter-grok-fetcher.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bookmark Delta ETL Pipeline

Features

🔄 Delta Load ETL

🤖 AI Enrichment

📊 DocETL Integration

Installation

Usage

Basic ETL Run

Full Enrichment

Options

Examples

Data Sources & .env

Fetch + Ingest (Unified)

Data Flow

Output Structure

Enhanced Bookmark Schema

Integration with Search UI

Monitoring & Statistics

Best Practices

Troubleshooting

Contributing

Unified Format + Delta Loading (New)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

j-94/deltaload

Folders and files

Latest commit

History

Repository files navigation

Bookmark Delta ETL Pipeline

Features

🔄 Delta Load ETL

🤖 AI Enrichment

📊 DocETL Integration

Installation

Usage

Basic ETL Run

Full Enrichment

Options

Examples

Data Sources & .env

Fetch + Ingest (Unified)

Data Flow

Output Structure

Enhanced Bookmark Schema

Integration with Search UI

Monitoring & Statistics

Best Practices

Troubleshooting

Contributing

Unified Format + Delta Loading (New)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages