Gemini Vision Capabilities Demo

A comprehensive web application showcasing Google's Gemini Vision API advanced capabilities, including YouTube video analysis, structured data extraction, and multimodal reasoning.

Features

This demo highlights three key Gemini Vision capabilities:

1. Image Analysis

Upload and analyze single images
Custom prompts for specific questions
Real-time AI-powered insights
Support for JPEG, PNG, and WebP formats

2. Video Summarization (YouTube)

Analyze YouTube videos by URL
Native Gemini video understanding (no frame extraction needed)
Comprehensive video content analysis
Scene detection and transition identification
Educational content breakdown support
Works with any public YouTube video

3. Structured Data Extraction

Extract structured data from images as JSON
Support for multiple document types:
- Tables and spreadsheets
- Receipts and invoices
- Forms and documents
- Charts and graphs
- Business cards
Automatic OCR with structured output
JSON response format for easy integration

Prerequisites

Node.js (v14 or higher)
A Google Gemini API key (Get one here)

Installation

Clone or navigate to this directory:

cd gemini-vision

Install dependencies:

npm install

Create a .env file in the root directory:

cp .env.example .env

Add your Gemini API key to the .env file:

GEMINI_API_KEY=your_actual_api_key_here

Usage

Start the server:

npm start

Open your browser and navigate to:

http://localhost:3000

Choose a capability to explore:
- Image Analysis: Upload a single image for detailed analysis
- Video Summarization: Provide a YouTube URL for comprehensive analysis
- Structured Data Extraction: Extract structured information from documents

API Endpoints

POST /analyze

Analyzes an uploaded image using the Gemini Vision API.

Request:

Content-Type: multipart/form-data
Body:
- image: Image file (JPEG, PNG, or WebP, max 10MB)
- prompt: (Optional) Custom prompt for image analysis

Response:

{
  "success": true,
  "analysis": "Detailed description of the image...",
  "filename": "original-filename.jpg"
}

POST /analyze-video

Analyzes a YouTube video using Gemini's native video understanding.

Request:

Content-Type: application/json
Body:

{
  "youtubeUrl": "https://www.youtube.com/watch?v=VIDEO_ID",
  "prompt": "Optional custom prompt for video analysis"
}

Response:

{
  "success": true,
  "summary": "Comprehensive video analysis...",
  "videoUrl": "https://www.youtube.com/watch?v=VIDEO_ID"
}

POST /extract-data

Extracts structured data from an image and returns it as JSON.

Request:

Content-Type: multipart/form-data
Body:
- image: Image file (JPEG, PNG, or WebP, max 10MB)
- extractionType: Type of extraction (general, table, receipt, form, chart, card)

Response:

{
  "success": true,
  "extractionType": "receipt",
  "data": {
    "merchant": "Store Name",
    "date": "2025-01-15",
    "total": "49.99",
    "items": [
      {"name": "Item 1", "price": "24.99"},
      {"name": "Item 2", "price": "25.00"}
    ]
  },
  "filename": "receipt.jpg"
}

GET /health

Health check endpoint to verify API configuration.

Response:

{
  "status": "ok",
  "apiKeyConfigured": true
}

Project Structure

gemini-vision/
├── server.js           # Express server with Gemini API integration
├── package.json        # Project dependencies
├── .env               # Environment variables (not in git)
├── .env.example       # Environment variables template
├── .gitignore        # Git ignore rules
├── README.md         # This file
├── list-models.js    # Utility to list available models
├── public/
│   └── index.html    # Frontend UI with three capability tabs
└── uploads/          # Temporary upload directory (auto-created)

Configuration

You can configure the following environment variables in your .env file:

GEMINI_API_KEY (required): Your Google Gemini API key
PORT (optional): Server port (default: 3000)

Use Cases Demonstrated

Video Summarization (YouTube)

Educational Content: Analyze tutorial videos and provide step-by-step breakdowns
Meeting Summaries: Extract key points from recorded meetings
Content Analysis: Identify scenes and transitions in video content
Lecture Notes: Summarize educational lectures and presentations
Works with any public YouTube video

Structured Data Extraction

Invoice Processing: Extract line items, totals, and merchant info
Form Digitization: Convert paper forms to structured data
Chart Analysis: Extract data points from graphs and visualizations
Business Card Scanning: Digitize contact information

Limitations & Notes

YouTube videos must be PUBLIC (unlisted and private videos won't work)
The app leverages Gemini's native video understanding capabilities
Structured data extraction works best with clear, high-resolution images
The app automatically cleans up temporary files after processing
File size limit for images: 10MB

Troubleshooting

"GEMINI_API_KEY not configured" error:

Make sure you've created a .env file
Verify your API key is correctly set in the .env file
Restart the server after adding the API key

"Invalid YouTube URL" error or 403 Forbidden:

Ensure the URL is from youtube.com or youtu.be
Make sure the video is PUBLIC (not unlisted or private)
Try a different public video URL

Image upload fails:

Check that your image is under 10MB
Ensure the file format is JPEG, PNG, or WebP
Try a different image to rule out corruption

Port already in use:

Change the PORT in your .env file
Or stop the other application using port 3000

Technologies Used

Backend: Node.js, Express
AI: Google Gemini 3 Pro Preview (latest multimodal model)
File Upload: Multer
Frontend: Vanilla HTML, CSS, JavaScript

Getting Your Gemini API Key

Go to Google AI Studio
Sign in with your Google account
Click "Create API Key"
Copy the key and add it to your .env file

Example Use Cases

Video Analysis Prompts

"Summarize this video in 3-5 sentences."
"Identify all the key scenes and transitions in this video."
"What is this video teaching or demonstrating? Provide a step-by-step breakdown."
"Describe the main actions and events that occur in this video."

Example YouTube Videos to Try

Tutorial videos
Product demonstrations
Educational lectures
Conference talks
How-to guides

Structured Data Extraction Types

Tables: Extract headers and row data from spreadsheet images
Receipts: Parse merchant name, date, items, and totals
Forms: Extract field labels and values
Charts: Convert visual data into structured JSON
Business Cards: Digitize contact information

Model Information

This app uses Gemini 3 Pro Preview (gemini-3-pro-preview), the latest preview model, which supports:

Multimodal input (text + images + video)
Large context window (1M tokens)
JSON structured output
Native YouTube video analysis (public videos only)
Multiple image processing in a single request

Note: There is currently no stable Gemini 3 release - only the preview version is available.

To see all available models, run:

node list-models.js

License

MIT

About Gemini Vision

This demo showcases Google Gemini's advanced vision capabilities:

Native video understanding with YouTube URL support
Structured data extraction with JSON schema support
Large-context reasoning across complex visual content
Educational applications for content breakdown and analysis

No frame extraction or video processing libraries required - Gemini handles video analysis natively!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gemini Vision Capabilities Demo

Features

1. Image Analysis

2. Video Summarization (YouTube)

3. Structured Data Extraction

Prerequisites

Installation

Usage

API Endpoints

POST /analyze

POST /analyze-video

POST /extract-data

GET /health

Project Structure

Configuration

Use Cases Demonstrated

Video Summarization (YouTube)

Structured Data Extraction

Limitations & Notes

Troubleshooting

Technologies Used

Getting Your Gemini API Key

Example Use Cases

Video Analysis Prompts

Example YouTube Videos to Try

Structured Data Extraction Types

Model Information

License

About Gemini Vision

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
public		public
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
list-models.js		list-models.js
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js

argotdev/gemini-vision

Folders and files

Latest commit

History

Repository files navigation

Gemini Vision Capabilities Demo

Features

1. Image Analysis

2. Video Summarization (YouTube)

3. Structured Data Extraction

Prerequisites

Installation

Usage

API Endpoints

POST /analyze

POST /analyze-video

POST /extract-data

GET /health

Project Structure

Configuration

Use Cases Demonstrated

Video Summarization (YouTube)

Structured Data Extraction

Limitations & Notes

Troubleshooting

Technologies Used

Getting Your Gemini API Key

Example Use Cases

Video Analysis Prompts

Example YouTube Videos to Try

Structured Data Extraction Types

Model Information

License

About Gemini Vision

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages