Skip to content

TheAuditorTool/Auditor

Repository files navigation

TheAuditor

Version 1.6.4-dev1 | Offline-First AI-Centric SAST & Code Intelligence Platform

Modern static analysis reimagined: Database-driven, AI-optimized, zero-fallback architecture for Python and JavaScript/TypeScript projects.

Requires Python >=3.14


What is TheAuditor?

TheAuditor is a production-grade offline SAST tool that indexes your entire codebase into a structured SQLite database, enabling:

  • 200+ security vulnerability patterns detected with 1-2% false positive rate
  • Complete data flow analysis with cross-file taint tracking
  • Architectural intelligence with hotspot detection and circular dependency analysis
  • AI-optimized output designed for LLM consumption (<65KB chunks)
  • Database-first queries replacing slow file I/O (100x faster than grep-based tools)
  • Framework-aware detection for Django, Flask, FastAPI, React, Vue, Express, and more

Key Differentiator: While most SAST tools scan files repeatedly, TheAuditor indexes once, queries infinitely - enabling sub-second queries across 100K+ LOC.


Quick Start

# Install
pip install theauditor

# Run complete security audit (auto-creates .pf/ directory)
aud full

# View findings
cat .pf/readthis/summary.json

Output: .pf/readthis/ contains AI-optimized finding reports (<65KB per file)


Core Capabilities

1. Security Detection (200+ Patterns)

Category Detections False Positive Rate
Injection SQL, Command, Code, Template, LDAP, NoSQL, XPath <1%
XSS DOM, Response, Template, PostMessage, JavaScript Protocol 1-2%
Authentication JWT (11 checks), OAuth, Session, Missing Auth <1%
Cryptography Weak algorithms, ECB mode, insecure random, broken KDF <1%
Secrets AWS, GitHub, Stripe, Google (10+ providers) + entropy analysis 2-3%
API Security Rate limiting, auth bypass, key exposure 1-2%
PII Protection 200+ patterns, 15 privacy regulations (GDPR, CCPA, HIPAA) 2-3%
Infrastructure Docker, AWS CDK, Terraform, GitHub Actions 1-2%

Total: 50+ CWE coverage, 15+ frameworks supported

2. Taint Analysis V3 (Cross-File Data Flow)

aud taint --mode forward

Taint Engine V3 - Complete rewrite with 7x performance improvement:

  • Hybrid Analysis: Forward DFS from entries + backward IFDS from sinks
  • In-Memory Graph: Entire data flow graph cached in memory (10x speedup)
  • Semantic Deduplication: 4000 path permutations reduced to 1-2 distinct flows
  • ORM-Aware: Automatically expands useruser.posts via database relationships
  • Unified Sanitizers: Single registry for Joi, Zod, express-validator, DOMPurify

Performance: 6.6 minutes for 100K LOC (vs 45+ minutes in V2)

Detection Examples:

# Source
user_input = request.args.get('query')

# Intermediate (tracked across files)
result = process_query(user_input)  # theauditor/api.py:42

# Sink (detected as SQL injection)
cursor.execute(f"SELECT * FROM {result}")  # theauditor/db.py:156

See TAINT_ARCHITECTURE.md for technical details.

3. Architectural Intelligence

aud graph build
aud graph analyze

Hotspot Detection:

  • Identifies files with highest dependency connectivity (in-degree + out-degree)
  • Scores using PageRank centrality (transitive importance)
  • Escalates security findings in hotspots to CRITICAL severity

Circular Dependencies:

  • DFS-based cycle detection
  • Reports cycle size and participating modules
  • Flags architectural debt clusters

Impact Analysis:

aud impact --file auth.py --line 42

Shows blast radius: which files would be affected by changing this function?

4. Code Quality Analysis

Control Flow Graphs (CFG):

aud cfg analyze --complexity-threshold 15
  • McCabe cyclomatic complexity measurement
  • Dead code detection (unreachable blocks)
  • Visual CFG generation (DOT/SVG/PNG)

Linting Orchestration:

aud lint

Runs all available linters (ruff, mypy, eslint, tsc, prettier) and normalizes output to unified format.

5. AI-Optimized Output

All findings chunked into <65KB JSON files optimized for LLM context windows:

.pf/readthis/
├── summary.json              # Executive summary
├── patterns_chunk01.json     # Security patterns
├── taint_chunk01.json        # Taint analysis results
├── terraform_chunk01.json    # Infrastructure findings
└── *_chunk*.json             # Maximum 3 chunks per analysis type

Design Goal: Enable AI assistants to consume complete audit results without context overflow.


Advanced Features

Database-First Queries

aud query --symbol authenticate --show-callers

Query indexed AST data instead of grepping files:

  • 100x faster than file-based search
  • 100% accurate - no regex guessing
  • Relationship-aware - knows who calls what, who imports what
  • Cross-language - queries Python and JavaScript in single query

Savings: 5,000-10,000 tokens per refactoring iteration vs traditional file reading.

Machine Learning Risk Prediction

# Basic ML training
aud learn --enable-git
aud suggest --topk 10

# Advanced: Include AI agent behavior analysis (Tier 5)
aud learn --session-dir ~/.claude/projects/YourProject --session-analysis --print-stats

Learns from execution history to predict:

  • Which files are root causes of failures
  • Which files will need editing next
  • Risk scores for prioritization

Features: 97 dimensions across 5 tiers:

  • Tier 1-4: Pipeline logs, journal events, security patterns, git history
  • Tier 5 (NEW): Agent behavior intelligence from session logs
    • Workflow compliance (blueprint_first, query_before_edit)
    • Risk scores from SAST-scored diffs
    • Blind edit rates (edits without prior reads)
    • User engagement (INVERSE: lower = agent self-sufficient)

Session Analysis: Analyzes Claude Code session logs to correlate agent execution patterns with code quality. Shows which workflow violations lead to failures.

Planning & Verification System

aud planning init --name "Auth0 Migration"
aud planning add-task 1 --title "Migrate routes"
aud planning verify-task 1 1

Database-centric task tracking with spec-based verification:

  • Tracks refactoring progress with deterministic verification
  • Per-task checkpoint sequences (independent rollback)
  • RefactorProfile YAML specs (compatible with aud refactor)

Infrastructure-as-Code Analysis

aud cdk analyze          # AWS CDK security
aud terraform            # Terraform compliance
aud docker-analyze       # Docker security

Detects misconfigurations in cloud resource definitions before deployment.


Architecture Highlights

Two-Database System

repo_index.db (~180MB, regenerated fresh every aud full):

  • 250 normalized relational tables across 9 schema domains
  • Core (24 tables): symbols, assignments, function_call_args, CFG blocks
  • Python (59 tables): ORM models, routes, decorators, async, pytest, Django, Flask, FastAPI
  • JavaScript/Node (26 tables): React/Vue components, TypeScript types, Prisma, Angular
  • Infrastructure (18 tables): Docker, Terraform, CDK, GitHub Actions
  • GraphQL (8 tables): Schema analysis, resolvers, execution edges
  • Security (7 tables): SQL queries, JWT patterns, env vars, taint sources/sinks
  • Frameworks (5 tables): Cross-language ORM relationships, API endpoints
  • Planning (9 tables): Task tracking, verification specs, checkpoints

graphs.db (~130MB, optional):

  • Pre-computed graph structures built from repo_index.db
  • Used only by graph commands (not core analysis)
  • Call graphs, import graphs, data flow graphs

Why separate? Different query patterns (point lookups vs graph traversal). Merging would make indexing 53% slower.

Zero Fallback Policy

Critical Design Principle: Database regenerated fresh every run - if data is missing, analysis FAILS hard (not graceful degradation).

Banned Patterns:

  • ❌ No database fallback queries
  • ❌ No try/except with alternative logic
  • ❌ No table existence checks
  • ❌ No regex fallbacks when database query fails

Rationale: Fallbacks hide bugs. If query fails, pipeline is broken and should crash immediately.

4-Layer Pipeline Architecture

Layer 1: ORCHESTRATOR
  └─> Coordinates file discovery, AST parsing, extractor selection

Layer 2: EXTRACTORS (12 languages)
  └─> Python (28 specialized modules), JavaScript/TypeScript, Terraform, Docker, Prisma, Rust, SQL, GitHub Actions, GraphQL

Layer 3: STORAGE (Handler dispatch)
  └─> 100+ handlers mapping data types to database operations

Layer 4: DATABASE (Multiple inheritance)
  └─> 11 domain-specific mixins with schema-driven code generation

Performance: 30-60s indexing for 100K LOC, 10-30s analysis.


Supported Languages & Frameworks

Language Frameworks Tables Key Features
Python Django, Flask, FastAPI, SQLAlchemy, Pydantic, Celery, Marshmallow, DRF, WTForms 59 ORM models, routes, decorators, async, pytest, signals, middleware, validators
JavaScript/TypeScript React, Vue, Angular, Express, Next.js, Prisma, Sequelize, BullMQ 26 Components, hooks, TypeScript types, JSX, job queues
GraphQL Apollo, graphql-core 8 Schema analysis, resolvers, execution edges, field mapping
Terraform All providers 5 Resources, variables, outputs, data sources
Docker Compose, Dockerfile 8 Images, services, env vars, healthchecks
AWS CDK Python + TypeScript CDK 3 Constructs, properties, IAM policies
GitHub Actions Workflows 7 Jobs, steps, permissions, dependencies
Rust Generic 2 Functions, imports (tree-sitter)
SQL DDL 1 Tables, indexes, views

Installation

Requirements

  • Python 3.14+ (required - uses modern type hints and PEP 695 syntax)
  • Git (for temporal analysis)
  • Node.js (for JavaScript analysis)

Install from PyPI

pip install theauditor

Install from Source

git clone https://github.com/yourusername/theauditor.git
cd theauditor
pip install -e ".[dev,linters]"

Setup AI Tools (Optional, ~500MB)

aud setup-ai --target .

Downloads OSV vulnerability database, npm audit data, sandbox runtime.


Usage Examples

Basic Workflow

# Initialize (creates .pf/ with databases)
aud init

# Run complete audit
aud full

# View findings
cat .pf/readthis/summary.json

Incremental Analysis (10-100x faster)

# Create workset (changed files + dependencies)
aud workset --diff main..feature

# Analyze only changed code
aud taint-analyze --workset
aud lint --workset

Query Relationships

# Find function
aud query --symbol authenticate

# Show callers
aud query --symbol authenticate --show-callers

# Show API dependencies
aud query --api "/users" --show-dependencies

Graph Analysis

# Build graphs
aud graph build

# Detect cycles and hotspots
aud graph analyze

# Visualize
aud graph viz --view cycles --format svg

Machine Learning

# Train models
aud learn --enable-git

# Get predictions
aud suggest --topk 10

Exit Codes

Code Meaning Commands
0 Success, no critical issues All commands
1 High severity findings aud full, aud taint-analyze
2 Critical vulnerabilities aud full, aud deps --vuln-scan
3 Analysis incomplete/failed aud full, aud impact

Comparison to Other SAST Tools

Feature TheAuditor Semgrep Bandit SonarQube
Offline-First
Database-Driven ✅ (SQLite) ✅ (PostgreSQL)
Cross-File Taint ✅ (5+ hops) ⚠️ (limited)
Framework-Aware ✅ (15+) ⚠️
AI-Optimized Output ✅ (<65KB chunks)
Graph Analysis ✅ (hotspots, cycles)
ML Risk Prediction ⚠️
False Positive Rate 1-2% 2-5% 3-10% 1-3%
Query Language SQL Custom N/A Custom
Cost Free (AGPL-3.0) Free/Paid Free Paid

Key Advantage: TheAuditor's database-first design enables complex queries (e.g., "show all functions that process user input AND call SQL") in milliseconds, while other tools require multiple scans.


Configuration

Runtime Configuration

Create .pf/config.json:

{
  "limits": {
    "max_file_size": 2097152,
    "max_chunk_size": 65536
  },
  "timeouts": {
    "analysis_timeout": 1800,
    "lint_timeout": 300
  }
}

Environment Variables

export THEAUDITOR_LIMITS_MAX_FILE_SIZE=4194304
export THEAUDITOR_TIMEOUTS_ANALYSIS=3600

Project-Specific Config

Add to pyproject.toml:

[tool.theauditor]
exclude_patterns = ["tests/", "migrations/"]
severity_threshold = "high"

Performance Characteristics

Project Size Indexing Analysis Database Size Memory
Small (5K LOC) ~30s ~10s ~20MB ~200MB
Medium (20K LOC) ~60s ~30s ~80MB ~500MB
Large (100K LOC) ~180s ~90s ~400MB ~1.5GB
Monorepo (500K+ LOC) ~600s ~300s ~2GB ~4GB

Second Run: 5-10x faster due to AST caching (.pf/.ast_cache/)


Troubleshooting

"Schema mismatch" error

# Regenerate database
aud index --exclude-self

Out of memory

# Reduce batch size
export THEAUDITOR_LIMITS_BATCH_SIZE=100

Slow indexing

# Exclude test files
aud index --exclude-patterns "tests/" "node_modules/"

Windows path issues

Use absolute paths with backslashes:

cd C:\Users\YourName\Desktop\TheAuditor
aud index --root C:\Users\YourName\Desktop\TheAuditor

Contributing

See Contributing.md for development setup, coding standards, and testing guidelines.

Note: Contributions are temporarily paused while legal entity formation is completed. See Contributing.md for details.


Documentation


License

AGPL-3.0 - See LICENSE file for details.


Credits

Built with:

  • tree-sitter - AST parsing
  • scikit-learn - Machine learning
  • NetworkX - Graph algorithms
  • Click - CLI framework
  • SQLite - Database engine

Roadmap

  • TypeScript/JavaScript CDK support (completed in v1.6.4)
  • GraphQL analysis and security rules (completed in v1.6.4)
  • Python framework parity (Django, Flask, FastAPI, Celery) (completed in v1.6.4)
  • IFDS-based taint analysis with field sensitivity (completed in v1.6.4)
  • Real-time analysis (file watcher mode)
  • VS Code extension
  • GitHub Action for CI/CD
  • Web UI for visualization
  • Plugin system for custom rules

Support


Made with precision engineering for AI assistants and security engineers.