TheAuditor

Version 1.6.4-dev1 | Offline-First AI-Centric SAST & Code Intelligence Platform

Modern static analysis reimagined: Database-driven, AI-optimized, zero-fallback architecture for Python and JavaScript/TypeScript projects.

Requires Python >=3.14

What is TheAuditor?

TheAuditor is a production-grade offline SAST tool that indexes your entire codebase into a structured SQLite database, enabling:

200+ security vulnerability patterns detected with 1-2% false positive rate
Complete data flow analysis with cross-file taint tracking
Architectural intelligence with hotspot detection and circular dependency analysis
AI-optimized output designed for LLM consumption (<65KB chunks)
Database-first queries replacing slow file I/O (100x faster than grep-based tools)
Framework-aware detection for Django, Flask, FastAPI, React, Vue, Express, and more

Key Differentiator: While most SAST tools scan files repeatedly, TheAuditor indexes once, queries infinitely - enabling sub-second queries across 100K+ LOC.

Quick Start

# Install
pip install theauditor

# Run complete security audit (auto-creates .pf/ directory)
aud full

# View findings
cat .pf/readthis/summary.json

Output: .pf/readthis/ contains AI-optimized finding reports (<65KB per file)

Core Capabilities

1. Security Detection (200+ Patterns)

Category	Detections	False Positive Rate
Injection	SQL, Command, Code, Template, LDAP, NoSQL, XPath	<1%
XSS	DOM, Response, Template, PostMessage, JavaScript Protocol	1-2%
Authentication	JWT (11 checks), OAuth, Session, Missing Auth	<1%
Cryptography	Weak algorithms, ECB mode, insecure random, broken KDF	<1%
Secrets	AWS, GitHub, Stripe, Google (10+ providers) + entropy analysis	2-3%
API Security	Rate limiting, auth bypass, key exposure	1-2%
PII Protection	200+ patterns, 15 privacy regulations (GDPR, CCPA, HIPAA)	2-3%
Infrastructure	Docker, AWS CDK, Terraform, GitHub Actions	1-2%

Total: 50+ CWE coverage, 15+ frameworks supported

2. Taint Analysis V3 (Cross-File Data Flow)

aud taint --mode forward

Taint Engine V3 - Complete rewrite with 7x performance improvement:

Hybrid Analysis: Forward DFS from entries + backward IFDS from sinks
In-Memory Graph: Entire data flow graph cached in memory (10x speedup)
Semantic Deduplication: 4000 path permutations reduced to 1-2 distinct flows
ORM-Aware: Automatically expands user → user.posts via database relationships
Unified Sanitizers: Single registry for Joi, Zod, express-validator, DOMPurify

Performance: 6.6 minutes for 100K LOC (vs 45+ minutes in V2)

Detection Examples:

# Source
user_input = request.args.get('query')

# Intermediate (tracked across files)
result = process_query(user_input)  # theauditor/api.py:42

# Sink (detected as SQL injection)
cursor.execute(f"SELECT * FROM {result}")  # theauditor/db.py:156

See TAINT_ARCHITECTURE.md for technical details.

3. Architectural Intelligence

aud graph build
aud graph analyze

Hotspot Detection:

Identifies files with highest dependency connectivity (in-degree + out-degree)
Scores using PageRank centrality (transitive importance)
Escalates security findings in hotspots to CRITICAL severity

Circular Dependencies:

DFS-based cycle detection
Reports cycle size and participating modules
Flags architectural debt clusters

Impact Analysis:

aud impact --file auth.py --line 42

Shows blast radius: which files would be affected by changing this function?

4. Code Quality Analysis

Control Flow Graphs (CFG):

aud cfg analyze --complexity-threshold 15

McCabe cyclomatic complexity measurement
Dead code detection (unreachable blocks)
Visual CFG generation (DOT/SVG/PNG)

Linting Orchestration:

aud lint

Runs all available linters (ruff, mypy, eslint, tsc, prettier) and normalizes output to unified format.

5. AI-Optimized Output

All findings chunked into <65KB JSON files optimized for LLM context windows:

.pf/readthis/
├── summary.json              # Executive summary
├── patterns_chunk01.json     # Security patterns
├── taint_chunk01.json        # Taint analysis results
├── terraform_chunk01.json    # Infrastructure findings
└── *_chunk*.json             # Maximum 3 chunks per analysis type

Design Goal: Enable AI assistants to consume complete audit results without context overflow.

Advanced Features

Database-First Queries

aud query --symbol authenticate --show-callers

Query indexed AST data instead of grepping files:

100x faster than file-based search
100% accurate - no regex guessing
Relationship-aware - knows who calls what, who imports what
Cross-language - queries Python and JavaScript in single query

Savings: 5,000-10,000 tokens per refactoring iteration vs traditional file reading.

Machine Learning Risk Prediction

# Basic ML training
aud learn --enable-git
aud suggest --topk 10

# Advanced: Include AI agent behavior analysis (Tier 5)
aud learn --session-dir ~/.claude/projects/YourProject --session-analysis --print-stats

Learns from execution history to predict:

Which files are root causes of failures
Which files will need editing next
Risk scores for prioritization

Features: 97 dimensions across 5 tiers:

Tier 1-4: Pipeline logs, journal events, security patterns, git history
Tier 5 (NEW): Agent behavior intelligence from session logs
- Workflow compliance (blueprint_first, query_before_edit)
- Risk scores from SAST-scored diffs
- Blind edit rates (edits without prior reads)
- User engagement (INVERSE: lower = agent self-sufficient)

Session Analysis: Analyzes Claude Code session logs to correlate agent execution patterns with code quality. Shows which workflow violations lead to failures.

Planning & Verification System

aud planning init --name "Auth0 Migration"
aud planning add-task 1 --title "Migrate routes"
aud planning verify-task 1 1

Database-centric task tracking with spec-based verification:

Tracks refactoring progress with deterministic verification
Per-task checkpoint sequences (independent rollback)
RefactorProfile YAML specs (compatible with aud refactor)

Infrastructure-as-Code Analysis

aud cdk analyze          # AWS CDK security
aud terraform            # Terraform compliance
aud docker-analyze       # Docker security

Detects misconfigurations in cloud resource definitions before deployment.

Architecture Highlights

Two-Database System

repo_index.db (~180MB, regenerated fresh every aud full):

250 normalized relational tables across 9 schema domains
Core (24 tables): symbols, assignments, function_call_args, CFG blocks
Python (59 tables): ORM models, routes, decorators, async, pytest, Django, Flask, FastAPI
JavaScript/Node (26 tables): React/Vue components, TypeScript types, Prisma, Angular
Infrastructure (18 tables): Docker, Terraform, CDK, GitHub Actions
GraphQL (8 tables): Schema analysis, resolvers, execution edges
Security (7 tables): SQL queries, JWT patterns, env vars, taint sources/sinks
Frameworks (5 tables): Cross-language ORM relationships, API endpoints
Planning (9 tables): Task tracking, verification specs, checkpoints

graphs.db (~130MB, optional):

Pre-computed graph structures built from repo_index.db
Used only by graph commands (not core analysis)
Call graphs, import graphs, data flow graphs

Why separate? Different query patterns (point lookups vs graph traversal). Merging would make indexing 53% slower.

Zero Fallback Policy

Critical Design Principle: Database regenerated fresh every run - if data is missing, analysis FAILS hard (not graceful degradation).

Banned Patterns:

❌ No database fallback queries
❌ No try/except with alternative logic
❌ No table existence checks
❌ No regex fallbacks when database query fails

Rationale: Fallbacks hide bugs. If query fails, pipeline is broken and should crash immediately.

4-Layer Pipeline Architecture

Layer 1: ORCHESTRATOR
  └─> Coordinates file discovery, AST parsing, extractor selection

Layer 2: EXTRACTORS (12 languages)
  └─> Python (28 specialized modules), JavaScript/TypeScript, Terraform, Docker, Prisma, Rust, SQL, GitHub Actions, GraphQL

Layer 3: STORAGE (Handler dispatch)
  └─> 100+ handlers mapping data types to database operations

Layer 4: DATABASE (Multiple inheritance)
  └─> 11 domain-specific mixins with schema-driven code generation

Performance: 30-60s indexing for 100K LOC, 10-30s analysis.

Supported Languages & Frameworks

Language	Frameworks	Tables	Key Features
Python	Django, Flask, FastAPI, SQLAlchemy, Pydantic, Celery, Marshmallow, DRF, WTForms	59	ORM models, routes, decorators, async, pytest, signals, middleware, validators
JavaScript/TypeScript	React, Vue, Angular, Express, Next.js, Prisma, Sequelize, BullMQ	26	Components, hooks, TypeScript types, JSX, job queues
GraphQL	Apollo, graphql-core	8	Schema analysis, resolvers, execution edges, field mapping
Terraform	All providers	5	Resources, variables, outputs, data sources
Docker	Compose, Dockerfile	8	Images, services, env vars, healthchecks
AWS CDK	Python + TypeScript CDK	3	Constructs, properties, IAM policies
GitHub Actions	Workflows	7	Jobs, steps, permissions, dependencies
Rust	Generic	2	Functions, imports (tree-sitter)
SQL	DDL	1	Tables, indexes, views

Installation

Requirements

Python 3.14+ (required - uses modern type hints and PEP 695 syntax)
Git (for temporal analysis)
Node.js (for JavaScript analysis)

Install from PyPI

pip install theauditor

Install from Source

git clone https://github.com/yourusername/theauditor.git
cd theauditor
pip install -e ".[dev,linters]"

Setup AI Tools (Optional, ~500MB)

aud setup-ai --target .

Downloads OSV vulnerability database, npm audit data, sandbox runtime.

Usage Examples

Basic Workflow

# Initialize (creates .pf/ with databases)
aud init

# Run complete audit
aud full

# View findings
cat .pf/readthis/summary.json

Incremental Analysis (10-100x faster)

# Create workset (changed files + dependencies)
aud workset --diff main..feature

# Analyze only changed code
aud taint-analyze --workset
aud lint --workset

Query Relationships

# Find function
aud query --symbol authenticate

# Show callers
aud query --symbol authenticate --show-callers

# Show API dependencies
aud query --api "/users" --show-dependencies

Graph Analysis

# Build graphs
aud graph build

# Detect cycles and hotspots
aud graph analyze

# Visualize
aud graph viz --view cycles --format svg

Machine Learning

# Train models
aud learn --enable-git

# Get predictions
aud suggest --topk 10

Exit Codes

Code	Meaning	Commands
0	Success, no critical issues	All commands
1	High severity findings	`aud full`, `aud taint-analyze`
2	Critical vulnerabilities	`aud full`, `aud deps --vuln-scan`
3	Analysis incomplete/failed	`aud full`, `aud impact`

Comparison to Other SAST Tools

Feature	TheAuditor	Semgrep	Bandit	SonarQube
Offline-First	✅	❌	✅	❌
Database-Driven	✅ (SQLite)	❌	❌	✅ (PostgreSQL)
Cross-File Taint	✅ (5+ hops)	⚠️ (limited)	❌	✅
Framework-Aware	✅ (15+)	✅	⚠️	✅
AI-Optimized Output	✅ (<65KB chunks)	❌	❌	❌
Graph Analysis	✅ (hotspots, cycles)	❌	❌	✅
ML Risk Prediction	✅	❌	❌	⚠️
False Positive Rate	1-2%	2-5%	3-10%	1-3%
Query Language	SQL	Custom	N/A	Custom
Cost	Free (AGPL-3.0)	Free/Paid	Free	Paid

Key Advantage: TheAuditor's database-first design enables complex queries (e.g., "show all functions that process user input AND call SQL") in milliseconds, while other tools require multiple scans.

Configuration

Runtime Configuration

Create .pf/config.json:

{
  "limits": {
    "max_file_size": 2097152,
    "max_chunk_size": 65536
  },
  "timeouts": {
    "analysis_timeout": 1800,
    "lint_timeout": 300
  }
}

Environment Variables

export THEAUDITOR_LIMITS_MAX_FILE_SIZE=4194304
export THEAUDITOR_TIMEOUTS_ANALYSIS=3600

Project-Specific Config

Add to pyproject.toml:

[tool.theauditor]
exclude_patterns = ["tests/", "migrations/"]
severity_threshold = "high"

Performance Characteristics

Project Size	Indexing	Analysis	Database Size	Memory
Small (5K LOC)	~30s	~10s	~20MB	~200MB
Medium (20K LOC)	~60s	~30s	~80MB	~500MB
Large (100K LOC)	~180s	~90s	~400MB	~1.5GB
Monorepo (500K+ LOC)	~600s	~300s	~2GB	~4GB

Second Run: 5-10x faster due to AST caching (.pf/.ast_cache/)

Troubleshooting

"Schema mismatch" error

# Regenerate database
aud index --exclude-self

Out of memory

# Reduce batch size
export THEAUDITOR_LIMITS_BATCH_SIZE=100

Slow indexing

# Exclude test files
aud index --exclude-patterns "tests/" "node_modules/"

Windows path issues

Use absolute paths with backslashes:

cd C:\Users\YourName\Desktop\TheAuditor
aud index --root C:\Users\YourName\Desktop\TheAuditor

Contributing

See Contributing.md for development setup, coding standards, and testing guidelines.

Note: Contributions are temporarily paused while legal entity formation is completed. See Contributing.md for details.

Documentation

Architecture: Architecture.md - Complete system architecture and design
How to Use: HowToUse.md - Comprehensive command reference (43 commands)
Contributing: Contributing.md - Development guidelines
Developer Guide: CLAUDE.md - Coding standards and conventions (AI assistant context)
Taint Engine: docs/TAINT_ARCHITECTURE.md - IFDS-based flow analysis
CDK Analysis: docs/CDK_ARCHITECTURE.md - AWS CDK security scanning

License

AGPL-3.0 - See LICENSE file for details.

Credits

Built with:

tree-sitter - AST parsing
scikit-learn - Machine learning
NetworkX - Graph algorithms
Click - CLI framework
SQLite - Database engine

Roadmap

TypeScript/JavaScript CDK support (completed in v1.6.4)
GraphQL analysis and security rules (completed in v1.6.4)
Python framework parity (Django, Flask, FastAPI, Celery) (completed in v1.6.4)
IFDS-based taint analysis with field sensitivity (completed in v1.6.4)
Real-time analysis (file watcher mode)
VS Code extension
GitHub Action for CI/CD
Web UI for visualization
Plugin system for custom rules

Support

Issues: https://github.com/TheAuditorTool/Auditor/issues
Discussions: https://github.com/TheAuditorTool/Auditor/discussions
Documentation: https://github.com/TheAuditorTool/Auditor

Made with precision engineering for AI assistants and security engineers.

Name		Name	Last commit message	Last commit date
Latest commit History 345 Commits
agents		agents
openspec		openspec
scripts		scripts
tests/fixtures		tests/fixtures
theauditor		theauditor
verifiy		verifiy
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Architecture.md		Architecture.md
CLAUDE.md		CLAUDE.md
Contributing.md		Contributing.md
HowToUse.md		HowToUse.md
KnownIssues.md		KnownIssues.md
LICENSE		LICENSE
NOTICE_OF_INTELLECTUAL_PROPERTY.md		NOTICE_OF_INTELLECTUAL_PROPERTY.md
README.md		README.md
autonmouswork.md		autonmouswork.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

License

TheAuditorTool/Auditor

Folders and files

Latest commit

History

Repository files navigation

TheAuditor

What is TheAuditor?

Quick Start

Core Capabilities

1. Security Detection (200+ Patterns)

2. Taint Analysis V3 (Cross-File Data Flow)

3. Architectural Intelligence

4. Code Quality Analysis

5. AI-Optimized Output

Advanced Features

Database-First Queries

Machine Learning Risk Prediction

Planning & Verification System

Infrastructure-as-Code Analysis

Architecture Highlights

Two-Database System

Zero Fallback Policy

4-Layer Pipeline Architecture

Supported Languages & Frameworks

Installation

Requirements

Install from PyPI

Install from Source

Setup AI Tools (Optional, ~500MB)

Usage Examples

Basic Workflow

Incremental Analysis (10-100x faster)

Query Relationships

Graph Analysis

Machine Learning

Exit Codes

Comparison to Other SAST Tools

Configuration

Runtime Configuration

Environment Variables

Project-Specific Config

Performance Characteristics

Troubleshooting

"Schema mismatch" error

Out of memory

Slow indexing

Windows path issues

Contributing

Documentation

License

Credits

Roadmap

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages