Version 1.6.4-dev1 | Offline-First AI-Centric SAST & Code Intelligence Platform
Modern static analysis reimagined: Database-driven, AI-optimized, zero-fallback architecture for Python and JavaScript/TypeScript projects.
Requires Python >=3.14
TheAuditor is a production-grade offline SAST tool that indexes your entire codebase into a structured SQLite database, enabling:
- 200+ security vulnerability patterns detected with 1-2% false positive rate
- Complete data flow analysis with cross-file taint tracking
- Architectural intelligence with hotspot detection and circular dependency analysis
- AI-optimized output designed for LLM consumption (<65KB chunks)
- Database-first queries replacing slow file I/O (100x faster than grep-based tools)
- Framework-aware detection for Django, Flask, FastAPI, React, Vue, Express, and more
Key Differentiator: While most SAST tools scan files repeatedly, TheAuditor indexes once, queries infinitely - enabling sub-second queries across 100K+ LOC.
# Install
pip install theauditor
# Run complete security audit (auto-creates .pf/ directory)
aud full
# View findings
cat .pf/readthis/summary.jsonOutput: .pf/readthis/ contains AI-optimized finding reports (<65KB per file)
| Category | Detections | False Positive Rate |
|---|---|---|
| Injection | SQL, Command, Code, Template, LDAP, NoSQL, XPath | <1% |
| XSS | DOM, Response, Template, PostMessage, JavaScript Protocol | 1-2% |
| Authentication | JWT (11 checks), OAuth, Session, Missing Auth | <1% |
| Cryptography | Weak algorithms, ECB mode, insecure random, broken KDF | <1% |
| Secrets | AWS, GitHub, Stripe, Google (10+ providers) + entropy analysis | 2-3% |
| API Security | Rate limiting, auth bypass, key exposure | 1-2% |
| PII Protection | 200+ patterns, 15 privacy regulations (GDPR, CCPA, HIPAA) | 2-3% |
| Infrastructure | Docker, AWS CDK, Terraform, GitHub Actions | 1-2% |
Total: 50+ CWE coverage, 15+ frameworks supported
aud taint --mode forwardTaint Engine V3 - Complete rewrite with 7x performance improvement:
- Hybrid Analysis: Forward DFS from entries + backward IFDS from sinks
- In-Memory Graph: Entire data flow graph cached in memory (10x speedup)
- Semantic Deduplication: 4000 path permutations reduced to 1-2 distinct flows
- ORM-Aware: Automatically expands
user→user.postsvia database relationships - Unified Sanitizers: Single registry for Joi, Zod, express-validator, DOMPurify
Performance: 6.6 minutes for 100K LOC (vs 45+ minutes in V2)
Detection Examples:
# Source
user_input = request.args.get('query')
# Intermediate (tracked across files)
result = process_query(user_input) # theauditor/api.py:42
# Sink (detected as SQL injection)
cursor.execute(f"SELECT * FROM {result}") # theauditor/db.py:156See TAINT_ARCHITECTURE.md for technical details.
aud graph build
aud graph analyzeHotspot Detection:
- Identifies files with highest dependency connectivity (in-degree + out-degree)
- Scores using PageRank centrality (transitive importance)
- Escalates security findings in hotspots to CRITICAL severity
Circular Dependencies:
- DFS-based cycle detection
- Reports cycle size and participating modules
- Flags architectural debt clusters
Impact Analysis:
aud impact --file auth.py --line 42Shows blast radius: which files would be affected by changing this function?
Control Flow Graphs (CFG):
aud cfg analyze --complexity-threshold 15- McCabe cyclomatic complexity measurement
- Dead code detection (unreachable blocks)
- Visual CFG generation (DOT/SVG/PNG)
Linting Orchestration:
aud lintRuns all available linters (ruff, mypy, eslint, tsc, prettier) and normalizes output to unified format.
All findings chunked into <65KB JSON files optimized for LLM context windows:
.pf/readthis/
├── summary.json # Executive summary
├── patterns_chunk01.json # Security patterns
├── taint_chunk01.json # Taint analysis results
├── terraform_chunk01.json # Infrastructure findings
└── *_chunk*.json # Maximum 3 chunks per analysis type
Design Goal: Enable AI assistants to consume complete audit results without context overflow.
aud query --symbol authenticate --show-callersQuery indexed AST data instead of grepping files:
- 100x faster than file-based search
- 100% accurate - no regex guessing
- Relationship-aware - knows who calls what, who imports what
- Cross-language - queries Python and JavaScript in single query
Savings: 5,000-10,000 tokens per refactoring iteration vs traditional file reading.
# Basic ML training
aud learn --enable-git
aud suggest --topk 10
# Advanced: Include AI agent behavior analysis (Tier 5)
aud learn --session-dir ~/.claude/projects/YourProject --session-analysis --print-statsLearns from execution history to predict:
- Which files are root causes of failures
- Which files will need editing next
- Risk scores for prioritization
Features: 97 dimensions across 5 tiers:
- Tier 1-4: Pipeline logs, journal events, security patterns, git history
- Tier 5 (NEW): Agent behavior intelligence from session logs
- Workflow compliance (blueprint_first, query_before_edit)
- Risk scores from SAST-scored diffs
- Blind edit rates (edits without prior reads)
- User engagement (INVERSE: lower = agent self-sufficient)
Session Analysis: Analyzes Claude Code session logs to correlate agent execution patterns with code quality. Shows which workflow violations lead to failures.
aud planning init --name "Auth0 Migration"
aud planning add-task 1 --title "Migrate routes"
aud planning verify-task 1 1Database-centric task tracking with spec-based verification:
- Tracks refactoring progress with deterministic verification
- Per-task checkpoint sequences (independent rollback)
- RefactorProfile YAML specs (compatible with
aud refactor)
aud cdk analyze # AWS CDK security
aud terraform # Terraform compliance
aud docker-analyze # Docker securityDetects misconfigurations in cloud resource definitions before deployment.
repo_index.db (~180MB, regenerated fresh every aud full):
- 250 normalized relational tables across 9 schema domains
- Core (24 tables): symbols, assignments, function_call_args, CFG blocks
- Python (59 tables): ORM models, routes, decorators, async, pytest, Django, Flask, FastAPI
- JavaScript/Node (26 tables): React/Vue components, TypeScript types, Prisma, Angular
- Infrastructure (18 tables): Docker, Terraform, CDK, GitHub Actions
- GraphQL (8 tables): Schema analysis, resolvers, execution edges
- Security (7 tables): SQL queries, JWT patterns, env vars, taint sources/sinks
- Frameworks (5 tables): Cross-language ORM relationships, API endpoints
- Planning (9 tables): Task tracking, verification specs, checkpoints
graphs.db (~130MB, optional):
- Pre-computed graph structures built from repo_index.db
- Used only by graph commands (not core analysis)
- Call graphs, import graphs, data flow graphs
Why separate? Different query patterns (point lookups vs graph traversal). Merging would make indexing 53% slower.
Critical Design Principle: Database regenerated fresh every run - if data is missing, analysis FAILS hard (not graceful degradation).
Banned Patterns:
- ❌ No database fallback queries
- ❌ No try/except with alternative logic
- ❌ No table existence checks
- ❌ No regex fallbacks when database query fails
Rationale: Fallbacks hide bugs. If query fails, pipeline is broken and should crash immediately.
Layer 1: ORCHESTRATOR
└─> Coordinates file discovery, AST parsing, extractor selection
Layer 2: EXTRACTORS (12 languages)
└─> Python (28 specialized modules), JavaScript/TypeScript, Terraform, Docker, Prisma, Rust, SQL, GitHub Actions, GraphQL
Layer 3: STORAGE (Handler dispatch)
└─> 100+ handlers mapping data types to database operations
Layer 4: DATABASE (Multiple inheritance)
└─> 11 domain-specific mixins with schema-driven code generation
Performance: 30-60s indexing for 100K LOC, 10-30s analysis.
| Language | Frameworks | Tables | Key Features |
|---|---|---|---|
| Python | Django, Flask, FastAPI, SQLAlchemy, Pydantic, Celery, Marshmallow, DRF, WTForms | 59 | ORM models, routes, decorators, async, pytest, signals, middleware, validators |
| JavaScript/TypeScript | React, Vue, Angular, Express, Next.js, Prisma, Sequelize, BullMQ | 26 | Components, hooks, TypeScript types, JSX, job queues |
| GraphQL | Apollo, graphql-core | 8 | Schema analysis, resolvers, execution edges, field mapping |
| Terraform | All providers | 5 | Resources, variables, outputs, data sources |
| Docker | Compose, Dockerfile | 8 | Images, services, env vars, healthchecks |
| AWS CDK | Python + TypeScript CDK | 3 | Constructs, properties, IAM policies |
| GitHub Actions | Workflows | 7 | Jobs, steps, permissions, dependencies |
| Rust | Generic | 2 | Functions, imports (tree-sitter) |
| SQL | DDL | 1 | Tables, indexes, views |
- Python 3.14+ (required - uses modern type hints and PEP 695 syntax)
- Git (for temporal analysis)
- Node.js (for JavaScript analysis)
pip install theauditorgit clone https://github.com/yourusername/theauditor.git
cd theauditor
pip install -e ".[dev,linters]"aud setup-ai --target .Downloads OSV vulnerability database, npm audit data, sandbox runtime.
# Initialize (creates .pf/ with databases)
aud init
# Run complete audit
aud full
# View findings
cat .pf/readthis/summary.json# Create workset (changed files + dependencies)
aud workset --diff main..feature
# Analyze only changed code
aud taint-analyze --workset
aud lint --workset# Find function
aud query --symbol authenticate
# Show callers
aud query --symbol authenticate --show-callers
# Show API dependencies
aud query --api "/users" --show-dependencies# Build graphs
aud graph build
# Detect cycles and hotspots
aud graph analyze
# Visualize
aud graph viz --view cycles --format svg# Train models
aud learn --enable-git
# Get predictions
aud suggest --topk 10| Code | Meaning | Commands |
|---|---|---|
| 0 | Success, no critical issues | All commands |
| 1 | High severity findings | aud full, aud taint-analyze |
| 2 | Critical vulnerabilities | aud full, aud deps --vuln-scan |
| 3 | Analysis incomplete/failed | aud full, aud impact |
| Feature | TheAuditor | Semgrep | Bandit | SonarQube |
|---|---|---|---|---|
| Offline-First | ✅ | ❌ | ✅ | ❌ |
| Database-Driven | ✅ (SQLite) | ❌ | ❌ | ✅ (PostgreSQL) |
| Cross-File Taint | ✅ (5+ hops) | ❌ | ✅ | |
| Framework-Aware | ✅ (15+) | ✅ | ✅ | |
| AI-Optimized Output | ✅ (<65KB chunks) | ❌ | ❌ | ❌ |
| Graph Analysis | ✅ (hotspots, cycles) | ❌ | ❌ | ✅ |
| ML Risk Prediction | ✅ | ❌ | ❌ | |
| False Positive Rate | 1-2% | 2-5% | 3-10% | 1-3% |
| Query Language | SQL | Custom | N/A | Custom |
| Cost | Free (AGPL-3.0) | Free/Paid | Free | Paid |
Key Advantage: TheAuditor's database-first design enables complex queries (e.g., "show all functions that process user input AND call SQL") in milliseconds, while other tools require multiple scans.
Create .pf/config.json:
{
"limits": {
"max_file_size": 2097152,
"max_chunk_size": 65536
},
"timeouts": {
"analysis_timeout": 1800,
"lint_timeout": 300
}
}export THEAUDITOR_LIMITS_MAX_FILE_SIZE=4194304
export THEAUDITOR_TIMEOUTS_ANALYSIS=3600Add to pyproject.toml:
[tool.theauditor]
exclude_patterns = ["tests/", "migrations/"]
severity_threshold = "high"| Project Size | Indexing | Analysis | Database Size | Memory |
|---|---|---|---|---|
| Small (5K LOC) | ~30s | ~10s | ~20MB | ~200MB |
| Medium (20K LOC) | ~60s | ~30s | ~80MB | ~500MB |
| Large (100K LOC) | ~180s | ~90s | ~400MB | ~1.5GB |
| Monorepo (500K+ LOC) | ~600s | ~300s | ~2GB | ~4GB |
Second Run: 5-10x faster due to AST caching (.pf/.ast_cache/)
# Regenerate database
aud index --exclude-self# Reduce batch size
export THEAUDITOR_LIMITS_BATCH_SIZE=100# Exclude test files
aud index --exclude-patterns "tests/" "node_modules/"Use absolute paths with backslashes:
cd C:\Users\YourName\Desktop\TheAuditor
aud index --root C:\Users\YourName\Desktop\TheAuditorSee Contributing.md for development setup, coding standards, and testing guidelines.
Note: Contributions are temporarily paused while legal entity formation is completed. See Contributing.md for details.
- Architecture: Architecture.md - Complete system architecture and design
- How to Use: HowToUse.md - Comprehensive command reference (43 commands)
- Contributing: Contributing.md - Development guidelines
- Developer Guide: CLAUDE.md - Coding standards and conventions (AI assistant context)
- Taint Engine: docs/TAINT_ARCHITECTURE.md - IFDS-based flow analysis
- CDK Analysis: docs/CDK_ARCHITECTURE.md - AWS CDK security scanning
AGPL-3.0 - See LICENSE file for details.
Built with:
- tree-sitter - AST parsing
- scikit-learn - Machine learning
- NetworkX - Graph algorithms
- Click - CLI framework
- SQLite - Database engine
- TypeScript/JavaScript CDK support (completed in v1.6.4)
- GraphQL analysis and security rules (completed in v1.6.4)
- Python framework parity (Django, Flask, FastAPI, Celery) (completed in v1.6.4)
- IFDS-based taint analysis with field sensitivity (completed in v1.6.4)
- Real-time analysis (file watcher mode)
- VS Code extension
- GitHub Action for CI/CD
- Web UI for visualization
- Plugin system for custom rules
- Issues: https://github.com/TheAuditorTool/Auditor/issues
- Discussions: https://github.com/TheAuditorTool/Auditor/discussions
- Documentation: https://github.com/TheAuditorTool/Auditor
Made with precision engineering for AI assistants and security engineers.