Skip to content

argotdev/gemini-vision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemini Vision Capabilities Demo

A comprehensive web application showcasing Google's Gemini Vision API advanced capabilities, including YouTube video analysis, structured data extraction, and multimodal reasoning.

Features

This demo highlights three key Gemini Vision capabilities:

1. Image Analysis

  • Upload and analyze single images
  • Custom prompts for specific questions
  • Real-time AI-powered insights
  • Support for JPEG, PNG, and WebP formats

2. Video Summarization (YouTube)

  • Analyze YouTube videos by URL
  • Native Gemini video understanding (no frame extraction needed)
  • Comprehensive video content analysis
  • Scene detection and transition identification
  • Educational content breakdown support
  • Works with any public YouTube video

3. Structured Data Extraction

  • Extract structured data from images as JSON
  • Support for multiple document types:
    • Tables and spreadsheets
    • Receipts and invoices
    • Forms and documents
    • Charts and graphs
    • Business cards
  • Automatic OCR with structured output
  • JSON response format for easy integration

Prerequisites

  • Node.js (v14 or higher)
  • A Google Gemini API key (Get one here)

Installation

  1. Clone or navigate to this directory:
cd gemini-vision
  1. Install dependencies:
npm install
  1. Create a .env file in the root directory:
cp .env.example .env
  1. Add your Gemini API key to the .env file:
GEMINI_API_KEY=your_actual_api_key_here

Usage

  1. Start the server:
npm start
  1. Open your browser and navigate to:
http://localhost:3000
  1. Choose a capability to explore:
    • Image Analysis: Upload a single image for detailed analysis
    • Video Summarization: Provide a YouTube URL for comprehensive analysis
    • Structured Data Extraction: Extract structured information from documents

API Endpoints

POST /analyze

Analyzes an uploaded image using the Gemini Vision API.

Request:

  • Content-Type: multipart/form-data
  • Body:
    • image: Image file (JPEG, PNG, or WebP, max 10MB)
    • prompt: (Optional) Custom prompt for image analysis

Response:

{
  "success": true,
  "analysis": "Detailed description of the image...",
  "filename": "original-filename.jpg"
}

POST /analyze-video

Analyzes a YouTube video using Gemini's native video understanding.

Request:

  • Content-Type: application/json
  • Body:
{
  "youtubeUrl": "https://www.youtube.com/watch?v=VIDEO_ID",
  "prompt": "Optional custom prompt for video analysis"
}

Response:

{
  "success": true,
  "summary": "Comprehensive video analysis...",
  "videoUrl": "https://www.youtube.com/watch?v=VIDEO_ID"
}

POST /extract-data

Extracts structured data from an image and returns it as JSON.

Request:

  • Content-Type: multipart/form-data
  • Body:
    • image: Image file (JPEG, PNG, or WebP, max 10MB)
    • extractionType: Type of extraction (general, table, receipt, form, chart, card)

Response:

{
  "success": true,
  "extractionType": "receipt",
  "data": {
    "merchant": "Store Name",
    "date": "2025-01-15",
    "total": "49.99",
    "items": [
      {"name": "Item 1", "price": "24.99"},
      {"name": "Item 2", "price": "25.00"}
    ]
  },
  "filename": "receipt.jpg"
}

GET /health

Health check endpoint to verify API configuration.

Response:

{
  "status": "ok",
  "apiKeyConfigured": true
}

Project Structure

gemini-vision/
├── server.js           # Express server with Gemini API integration
├── package.json        # Project dependencies
├── .env               # Environment variables (not in git)
├── .env.example       # Environment variables template
├── .gitignore        # Git ignore rules
├── README.md         # This file
├── list-models.js    # Utility to list available models
├── public/
│   └── index.html    # Frontend UI with three capability tabs
└── uploads/          # Temporary upload directory (auto-created)

Configuration

You can configure the following environment variables in your .env file:

  • GEMINI_API_KEY (required): Your Google Gemini API key
  • PORT (optional): Server port (default: 3000)

Use Cases Demonstrated

Video Summarization (YouTube)

  • Educational Content: Analyze tutorial videos and provide step-by-step breakdowns
  • Meeting Summaries: Extract key points from recorded meetings
  • Content Analysis: Identify scenes and transitions in video content
  • Lecture Notes: Summarize educational lectures and presentations
  • Works with any public YouTube video

Structured Data Extraction

  • Invoice Processing: Extract line items, totals, and merchant info
  • Form Digitization: Convert paper forms to structured data
  • Chart Analysis: Extract data points from graphs and visualizations
  • Business Card Scanning: Digitize contact information

Limitations & Notes

  • YouTube videos must be PUBLIC (unlisted and private videos won't work)
  • The app leverages Gemini's native video understanding capabilities
  • Structured data extraction works best with clear, high-resolution images
  • The app automatically cleans up temporary files after processing
  • File size limit for images: 10MB

Troubleshooting

"GEMINI_API_KEY not configured" error:

  • Make sure you've created a .env file
  • Verify your API key is correctly set in the .env file
  • Restart the server after adding the API key

"Invalid YouTube URL" error or 403 Forbidden:

  • Ensure the URL is from youtube.com or youtu.be
  • Make sure the video is PUBLIC (not unlisted or private)
  • Try a different public video URL

Image upload fails:

  • Check that your image is under 10MB
  • Ensure the file format is JPEG, PNG, or WebP
  • Try a different image to rule out corruption

Port already in use:

  • Change the PORT in your .env file
  • Or stop the other application using port 3000

Technologies Used

  • Backend: Node.js, Express
  • AI: Google Gemini 3 Pro Preview (latest multimodal model)
  • File Upload: Multer
  • Frontend: Vanilla HTML, CSS, JavaScript

Getting Your Gemini API Key

  1. Go to Google AI Studio
  2. Sign in with your Google account
  3. Click "Create API Key"
  4. Copy the key and add it to your .env file

Example Use Cases

Video Analysis Prompts

  • "Summarize this video in 3-5 sentences."
  • "Identify all the key scenes and transitions in this video."
  • "What is this video teaching or demonstrating? Provide a step-by-step breakdown."
  • "Describe the main actions and events that occur in this video."

Example YouTube Videos to Try

  • Tutorial videos
  • Product demonstrations
  • Educational lectures
  • Conference talks
  • How-to guides

Structured Data Extraction Types

  • Tables: Extract headers and row data from spreadsheet images
  • Receipts: Parse merchant name, date, items, and totals
  • Forms: Extract field labels and values
  • Charts: Convert visual data into structured JSON
  • Business Cards: Digitize contact information

Model Information

This app uses Gemini 3 Pro Preview (gemini-3-pro-preview), the latest preview model, which supports:

  • Multimodal input (text + images + video)
  • Large context window (1M tokens)
  • JSON structured output
  • Native YouTube video analysis (public videos only)
  • Multiple image processing in a single request

Note: There is currently no stable Gemini 3 release - only the preview version is available.

To see all available models, run:

node list-models.js

License

MIT

About Gemini Vision

This demo showcases Google Gemini's advanced vision capabilities:

  • Native video understanding with YouTube URL support
  • Structured data extraction with JSON schema support
  • Large-context reasoning across complex visual content
  • Educational applications for content breakdown and analysis

No frame extraction or video processing libraries required - Gemini handles video analysis natively!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published