User Guide

Project Overview

Welcome to the AVA - Galactic StreamHub 🚀✨, a futuristic, voice-first, and visually aware AI assistant designed to be a true companion for complex tasks. This project, created for the Agent Development Kit Hackathon, pushes the boundaries of human-AI interaction by leveraging multimodal models and a sophisticated multi-agent system.

What Galactic StreamHub Does:

Sees & Understands Your World 👁️‍🗨️: AVA (Advanced Visual Assistant), the primary agent, utilizes your live video stream to analyze your environment and identify objects relevant to your goals.
Listens & Speaks 🗣️: Engage in real-time, bidirectional audio conversations. You can speak your requests, and AVA responds with synthesized voice and streaming text transcriptions.
Orchestrates a Team of Specialized Agents 🤖: AVA acts as the root agent, delegating complex tasks to the ProactiveContextOrchestratorAgent. This orchestrator manages a team:
- EnvironmentalMonitorAgent 🌿: Scans the visual context and user hints to spot opportunities for proactive help.
- ContextualPrecomputationAgent 🧠: If an opportunity is found, this agent formulates suggestions (e.g., "I see you have X, want a recipe?") and pre-fetches relevant information using tools.
- ReactiveTaskDelegatorAgent 🎯: Handles direct user requests or executes tasks based on accepted proactive suggestions, utilizing tools like CocktailDB 🍸, Google Maps 🗺️, or a general Google Search agent 🔍.
Proactive Assistance 💡: AVA can anticipate your needs based on visual context and general queries, offering timely suggestions even before you explicitly ask!
Tool Integration ⚙️: The system leverages external tools (Cocktail Database, Google Maps, Google Search) via the Model Context Protocol (MCP) or as AgentTools, allowing agents to access real-world information and services.
Responsive UI 🖥️: Features a sleek, dark-themed, "out of this world" user interface with 3D animated elements and a dynamic starfield background, providing controls for audio/video streaming and a "Comms Channel" for interactions.

How It Was Built:

Galactic StreamHub is built with Python and FastAPI for the backend, and HTML/CSS/JavaScript for the frontend, with Google's Agent Development Kit (ADK) at its core.

Backend: Utilizes ADK for agent creation (LlmAgent, custom BaseAgent like ProactiveContextOrchestratorAgent), tool management (AgentTool, MCPToolset), and session handling (InMemorySessionService). FastAPI serves the application and manages WebSocket connections for real-time communication, all powered by asyncio.
Frontend: A custom-designed interface using HTML/CSS and JavaScript (app.js) manages WebSocket communication, user input (text, audio via Web Audio API, video frames via HTMLVideoElement/canvas), and renders agent responses. Vanta.js creates the dynamic animated background.
Multimodal Input: The client captures video frames, which are sent to the Root Agent (AVA). AVA, leveraging Gemini's multimodal capabilities, analyzes these frames to identify seen_items in your environment.
Multi-Agent Orchestration Flow:
1. The Root Agent (AVA) analyzes the user's query and visual context.
2. AVA delegates to the ProactiveContextOrchestratorTool (which wraps the ProactiveContextOrchestratorAgent).
3. The ProactiveContextOrchestratorAgent then manages its sub-agents (EnvironmentalMonitor, ContextualPrecomputation, ReactiveTaskDelegator) to determine if a proactive suggestion is warranted or if a reactive task needs execution, using tools as needed.

System Flowchart

The following flowchart provides a visual representation of the system's architecture and the interaction pathways between different agents and components:

P.S Task Execution Agent was replaced as Proactive Context Orchestrator Agent.

Sample Queries & Interactions

You can type your requests into the message box, or use the voice input feature. Here are some ideas to get you started:

"I have gin, limes, and soda water. What cocktail can I make? If I'm missing something, where can I buy it near me?" (Try turning on your video for AVA to see the items!)

Biomedical Research & Visualization

"Synthesize the latest research on the diagnosis and treatment of non-small cell lung cancer. Find connections to ongoing clinical trials and show me a visual example of a lung nodule from a CT scan."
"Show me a chart of publication trends for 'immunotherapy for melanoma' over the last 10 years."
"What are the common adverse events reported for Ozempic?"

Tips for Interacting with AVA

Be clear and specific in your requests.
For visual tasks, ensure your camera has a clear view of the objects you want AVA to see.

Explore and have fun interacting with AVA! If you have any issues, please refer to the main README or contact the project maintainers.

Experiment! Try different combinations of visual cues and queries to see how AVA and its team of agents respond. The system is designed to be flexible and understand context.

AVA's Proactive Research: From Visual Cues to In-Depth Answers

AVA can proactively assist you by bridging the gap between your immediate environment and in-depth research. When you ask a general question about something AVA sees (for example, through your device's camera), it can intelligently formulate a more specific and powerful research query on your behalf. This means you get comprehensive information without needing to be a research expert.

Here are a few examples of how this works:

Example 1: Understanding Your Medication

The Scene: You have a prescription pill bottle, for instance, Metformin, on your desk.

Your Question: You ask, "Hey AVA, what's the latest research on this?"

AVA's Proactive Assistance:

AVA, using its visual recognition capabilities, identifies the medication as "Metformin" from the label.
Understanding your goal is to find the "latest research," AVA doesn't just do a basic search. Instead, it formulates a detailed query for its advanced research agent, such as: "Find recent studies on the mechanisms and new therapeutic uses of Metformin beyond diabetes."

Benefit to You: This feature allows AVA to transform a simple, everyday question about a visible object into a targeted research task. You receive a comprehensive summary, potentially including data on efficacy or side effects, without needing to know the specific scientific terms or research areas to investigate.

Example 2: Checking Supplement Safety

The Scene: A container of a fitness supplement, like Creatine Monohydrate, is on your kitchen counter.

Your Question: You mention, "I take this for my workouts. Is it actually safe long-term?"

AVA's Proactive Assistance:

AVA identifies the supplement as "Creatine" and notes your concern about its long-term safety.
It then proactively creates a precise research query, like: "Summarize clinical trial data and FDA reports on the long-term safety profile and renal impact of Creatine Monohydrate supplementation."

Benefit to You: AVA translates your casual concern about safety into a scientifically relevant question. It seeks out information from reliable sources like clinical trials and FDA reports, providing you with a well-rounded answer based on scientific evidence.

Example 3: Investigating Health Food Claims

The Scene: A box of turmeric tea is next to your mug.

Your Question: You ponder, "People say this is good for inflammation. Is there any real science behind that?"

AVA's Proactive Assistance:

AVA recognizes "turmeric" and understands you're looking for scientific evidence regarding its effect on "inflammation."
Going a step further, AVA knows that the key active compound in turmeric is curcumin. It generates a sophisticated research query, such as: "Investigate the efficacy of curcumin (from turmeric) as an anti-inflammatory agent and compare its reported effectiveness against NSAIDs like ibuprofen."

Benefit to You: This demonstrates AVA's advanced capabilities. It not only identifies the item but also understands its biochemical context (turmeric contains curcumin). It can then initiate research that delves into the active compounds and even compares their effectiveness to other known treatments, giving you a much richer and more nuanced understanding.