A comprehensive web application showcasing Google's Gemini Vision API advanced capabilities, including YouTube video analysis, structured data extraction, and multimodal reasoning.
This demo highlights three key Gemini Vision capabilities:
- Upload and analyze single images
- Custom prompts for specific questions
- Real-time AI-powered insights
- Support for JPEG, PNG, and WebP formats
- Analyze YouTube videos by URL
- Native Gemini video understanding (no frame extraction needed)
- Comprehensive video content analysis
- Scene detection and transition identification
- Educational content breakdown support
- Works with any public YouTube video
- Extract structured data from images as JSON
- Support for multiple document types:
- Tables and spreadsheets
- Receipts and invoices
- Forms and documents
- Charts and graphs
- Business cards
- Automatic OCR with structured output
- JSON response format for easy integration
- Node.js (v14 or higher)
- A Google Gemini API key (Get one here)
- Clone or navigate to this directory:
cd gemini-vision- Install dependencies:
npm install- Create a
.envfile in the root directory:
cp .env.example .env- Add your Gemini API key to the
.envfile:
GEMINI_API_KEY=your_actual_api_key_here
- Start the server:
npm start- Open your browser and navigate to:
http://localhost:3000
- Choose a capability to explore:
- Image Analysis: Upload a single image for detailed analysis
- Video Summarization: Provide a YouTube URL for comprehensive analysis
- Structured Data Extraction: Extract structured information from documents
Analyzes an uploaded image using the Gemini Vision API.
Request:
- Content-Type:
multipart/form-data - Body:
image: Image file (JPEG, PNG, or WebP, max 10MB)prompt: (Optional) Custom prompt for image analysis
Response:
{
"success": true,
"analysis": "Detailed description of the image...",
"filename": "original-filename.jpg"
}Analyzes a YouTube video using Gemini's native video understanding.
Request:
- Content-Type:
application/json - Body:
{
"youtubeUrl": "https://www.youtube.com/watch?v=VIDEO_ID",
"prompt": "Optional custom prompt for video analysis"
}Response:
{
"success": true,
"summary": "Comprehensive video analysis...",
"videoUrl": "https://www.youtube.com/watch?v=VIDEO_ID"
}Extracts structured data from an image and returns it as JSON.
Request:
- Content-Type:
multipart/form-data - Body:
image: Image file (JPEG, PNG, or WebP, max 10MB)extractionType: Type of extraction (general, table, receipt, form, chart, card)
Response:
{
"success": true,
"extractionType": "receipt",
"data": {
"merchant": "Store Name",
"date": "2025-01-15",
"total": "49.99",
"items": [
{"name": "Item 1", "price": "24.99"},
{"name": "Item 2", "price": "25.00"}
]
},
"filename": "receipt.jpg"
}Health check endpoint to verify API configuration.
Response:
{
"status": "ok",
"apiKeyConfigured": true
}gemini-vision/
├── server.js # Express server with Gemini API integration
├── package.json # Project dependencies
├── .env # Environment variables (not in git)
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
├── README.md # This file
├── list-models.js # Utility to list available models
├── public/
│ └── index.html # Frontend UI with three capability tabs
└── uploads/ # Temporary upload directory (auto-created)
You can configure the following environment variables in your .env file:
GEMINI_API_KEY(required): Your Google Gemini API keyPORT(optional): Server port (default: 3000)
- Educational Content: Analyze tutorial videos and provide step-by-step breakdowns
- Meeting Summaries: Extract key points from recorded meetings
- Content Analysis: Identify scenes and transitions in video content
- Lecture Notes: Summarize educational lectures and presentations
- Works with any public YouTube video
- Invoice Processing: Extract line items, totals, and merchant info
- Form Digitization: Convert paper forms to structured data
- Chart Analysis: Extract data points from graphs and visualizations
- Business Card Scanning: Digitize contact information
- YouTube videos must be PUBLIC (unlisted and private videos won't work)
- The app leverages Gemini's native video understanding capabilities
- Structured data extraction works best with clear, high-resolution images
- The app automatically cleans up temporary files after processing
- File size limit for images: 10MB
"GEMINI_API_KEY not configured" error:
- Make sure you've created a
.envfile - Verify your API key is correctly set in the
.envfile - Restart the server after adding the API key
"Invalid YouTube URL" error or 403 Forbidden:
- Ensure the URL is from youtube.com or youtu.be
- Make sure the video is PUBLIC (not unlisted or private)
- Try a different public video URL
Image upload fails:
- Check that your image is under 10MB
- Ensure the file format is JPEG, PNG, or WebP
- Try a different image to rule out corruption
Port already in use:
- Change the PORT in your
.envfile - Or stop the other application using port 3000
- Backend: Node.js, Express
- AI: Google Gemini 3 Pro Preview (latest multimodal model)
- File Upload: Multer
- Frontend: Vanilla HTML, CSS, JavaScript
- Go to Google AI Studio
- Sign in with your Google account
- Click "Create API Key"
- Copy the key and add it to your
.envfile
- "Summarize this video in 3-5 sentences."
- "Identify all the key scenes and transitions in this video."
- "What is this video teaching or demonstrating? Provide a step-by-step breakdown."
- "Describe the main actions and events that occur in this video."
- Tutorial videos
- Product demonstrations
- Educational lectures
- Conference talks
- How-to guides
- Tables: Extract headers and row data from spreadsheet images
- Receipts: Parse merchant name, date, items, and totals
- Forms: Extract field labels and values
- Charts: Convert visual data into structured JSON
- Business Cards: Digitize contact information
This app uses Gemini 3 Pro Preview (gemini-3-pro-preview), the latest preview model, which supports:
- Multimodal input (text + images + video)
- Large context window (1M tokens)
- JSON structured output
- Native YouTube video analysis (public videos only)
- Multiple image processing in a single request
Note: There is currently no stable Gemini 3 release - only the preview version is available.
To see all available models, run:
node list-models.jsMIT
This demo showcases Google Gemini's advanced vision capabilities:
- Native video understanding with YouTube URL support
- Structured data extraction with JSON schema support
- Large-context reasoning across complex visual content
- Educational applications for content breakdown and analysis
No frame extraction or video processing libraries required - Gemini handles video analysis natively!