Files
B.Tech-Project-III/thirdeye/docs/additional.md
2026-04-05 00:43:23 +05:30

1622 lines
59 KiB
Markdown

# ThirdEye — Additional Milestones (11→13)
> **Prerequisite: Milestone 10 must be COMPLETE and PASSING. These features layer on top of the existing working system.**
> **Same rule: Do NOT skip milestones. Do NOT skip tests. Every test must PASS before moving to the next milestone.**
---
## PRE-WORK: Dependencies & Config Updates
### Step 0.1 — Add new dependencies
Append to `thirdeye/requirements.txt`:
```
python-docx==1.1.2
PyPDF2==3.0.1
tavily-python==0.5.0
beautifulsoup4==4.12.3
```
Install:
```bash
cd thirdeye && pip install python-docx PyPDF2 tavily-python beautifulsoup4
```
### Step 0.2 — Add new env vars
Append to `thirdeye/.env`:
```bash
# Web Search (Milestone 12)
TAVILY_API_KEY=your_tavily_key_here
# Feature Flags
ENABLE_DOCUMENT_INGESTION=true
ENABLE_WEB_SEARCH=true
ENABLE_LINK_FETCH=true
```
**Get the key:** https://tavily.com → Sign up → Dashboard → API Keys (free tier: 1000 searches/month, no credit card)
### Step 0.3 — Update config.py
Add these lines at the bottom of `thirdeye/backend/config.py`:
```python
# Web Search
TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")
# Feature Flags
ENABLE_DOCUMENT_INGESTION = os.getenv("ENABLE_DOCUMENT_INGESTION", "true").lower() == "true"
ENABLE_WEB_SEARCH = os.getenv("ENABLE_WEB_SEARCH", "true").lower() == "true"
ENABLE_LINK_FETCH = os.getenv("ENABLE_LINK_FETCH", "true").lower() == "true"
```
---
## MILESTONE 11: Document & PDF Ingestion into RAG (105%)
**Goal:** When a PDF, DOCX, or TXT file is shared in a Telegram group, the bot auto-downloads it, extracts text, chunks it, and stores the chunks in ChromaDB as `document_knowledge` signals — queryable alongside chat signals.
### Step 11.1 — Create the Document Ingestor
Create file: `thirdeye/backend/agents/document_ingestor.py`
```python
"""Document Ingestor — extracts text from PDFs, DOCX, TXT and chunks for RAG storage."""
import os
import logging
import uuid
from datetime import datetime
logger = logging.getLogger("thirdeye.agents.document_ingestor")
# --- Text Extraction ---
def extract_text_from_pdf(file_path: str) -> list[dict]:
"""Extract text from PDF, returns list of {page: int, text: str}."""
from PyPDF2 import PdfReader
pages = []
try:
reader = PdfReader(file_path)
for i, page in enumerate(reader.pages):
text = page.extract_text()
if text and text.strip():
pages.append({"page": i + 1, "text": text.strip()})
except Exception as e:
logger.error(f"PDF extraction failed for {file_path}: {e}")
return pages
def extract_text_from_docx(file_path: str) -> list[dict]:
"""Extract text from DOCX, returns list of {page: 1, text: str} (DOCX has no real pages)."""
from docx import Document
try:
doc = Document(file_path)
full_text = "\n".join([p.text for p in doc.paragraphs if p.text.strip()])
if full_text.strip():
return [{"page": 1, "text": full_text.strip()}]
except Exception as e:
logger.error(f"DOCX extraction failed for {file_path}: {e}")
return []
def extract_text_from_txt(file_path: str) -> list[dict]:
"""Extract text from plain text file."""
try:
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
text = f.read().strip()
if text:
return [{"page": 1, "text": text}]
except Exception as e:
logger.error(f"TXT extraction failed for {file_path}: {e}")
return []
EXTRACTORS = {
".pdf": extract_text_from_pdf,
".docx": extract_text_from_docx,
".txt": extract_text_from_txt,
".md": extract_text_from_txt,
".csv": extract_text_from_txt,
".json": extract_text_from_txt,
".log": extract_text_from_txt,
}
def extract_text(file_path: str) -> list[dict]:
"""Route to correct extractor based on file extension."""
ext = os.path.splitext(file_path)[1].lower()
extractor = EXTRACTORS.get(ext)
if not extractor:
logger.warning(f"Unsupported file type: {ext} ({file_path})")
return []
return extractor(file_path)
# --- Chunking ---
def chunk_text(text: str, max_chars: int = 1500, overlap_chars: int = 200) -> list[str]:
"""
Split text into overlapping chunks.
Uses paragraph boundaries when possible, falls back to sentence boundaries,
then hard character splits. ~1500 chars ≈ ~375 tokens for embedding.
"""
if len(text) <= max_chars:
return [text]
# Split by paragraphs first
paragraphs = [p.strip() for p in text.split("\n") if p.strip()]
chunks = []
current_chunk = ""
for para in paragraphs:
# If adding this paragraph stays under limit, add it
if len(current_chunk) + len(para) + 1 <= max_chars:
current_chunk = (current_chunk + "\n" + para).strip()
else:
# Save current chunk if it has content
if current_chunk:
chunks.append(current_chunk)
# If single paragraph is too long, split it by sentences
if len(para) > max_chars:
sentences = para.replace(". ", ".\n").split("\n")
sub_chunk = ""
for sent in sentences:
if len(sub_chunk) + len(sent) + 1 <= max_chars:
sub_chunk = (sub_chunk + " " + sent).strip()
else:
if sub_chunk:
chunks.append(sub_chunk)
sub_chunk = sent
if sub_chunk:
current_chunk = sub_chunk
else:
current_chunk = ""
else:
current_chunk = para
if current_chunk:
chunks.append(current_chunk)
# Add overlap: prepend last N chars of previous chunk to each subsequent chunk
if overlap_chars > 0 and len(chunks) > 1:
overlapped = [chunks[0]]
for i in range(1, len(chunks)):
prev_tail = chunks[i - 1][-overlap_chars:]
# Find a word boundary in the overlap
space_idx = prev_tail.find(" ")
if space_idx > 0:
prev_tail = prev_tail[space_idx + 1:]
overlapped.append(prev_tail + " " + chunks[i])
chunks = overlapped
return chunks
# --- Main Ingestion ---
def ingest_document(
file_path: str,
group_id: str,
shared_by: str = "Unknown",
filename: str = None,
) -> list[dict]:
"""
Full pipeline: extract text → chunk → produce signal dicts ready for ChromaDB.
Args:
file_path: Path to the downloaded file on disk
group_id: Telegram group ID
shared_by: Who shared the file
filename: Original filename (for metadata)
Returns:
List of signal dicts ready for store_signals()
"""
if filename is None:
filename = os.path.basename(file_path)
# Extract
pages = extract_text(file_path)
if not pages:
logger.warning(f"No text extracted from {filename}")
return []
# Chunk each page
signals = []
total_chunks = 0
for page_data in pages:
page_num = page_data["page"]
chunks = chunk_text(page_data["text"])
for chunk_idx, chunk_text_str in enumerate(chunks):
if len(chunk_text_str.strip()) < 30:
continue # Skip tiny chunks
signal = {
"id": str(uuid.uuid4()),
"type": "document_knowledge",
"summary": f"[{filename} p{page_num}] {chunk_text_str[:150]}...",
"entities": [f"@{shared_by}", filename],
"severity": "low",
"status": "reference",
"sentiment": "neutral",
"urgency": "none",
"raw_quote": chunk_text_str,
"timestamp": datetime.utcnow().isoformat(),
"group_id": group_id,
"lens": "document",
"keywords": [filename, f"page_{page_num}", "document", shared_by],
}
signals.append(signal)
total_chunks += 1
logger.info(f"Ingested {filename}: {len(pages)} pages → {total_chunks} chunks for group {group_id}")
return signals
```
### Step 11.2 — Add document handler to the Telegram bot
Open `thirdeye/backend/bot/bot.py` and add the following.
**Add import at the top (after existing imports):**
```python
import os
import tempfile
from backend.config import ENABLE_DOCUMENT_INGESTION
from backend.agents.document_ingestor import ingest_document
from backend.db.chroma import store_signals
```
**Add this handler function (after `handle_message`):**
```python
async def handle_document(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Process documents/files shared in groups."""
if not ENABLE_DOCUMENT_INGESTION:
return
if not update.message or not update.message.document:
return
if not update.message.chat.type in ("group", "supergroup"):
return
doc = update.message.document
filename = doc.file_name or "unknown_file"
ext = os.path.splitext(filename)[1].lower()
# Only process supported file types
supported = {".pdf", ".docx", ".txt", ".md", ".csv", ".json", ".log"}
if ext not in supported:
return
# Size guard: skip files over 10MB
if doc.file_size and doc.file_size > 10 * 1024 * 1024:
logger.warning(f"Skipping oversized file: {filename} ({doc.file_size} bytes)")
return
group_id = str(update.message.chat_id)
shared_by = update.message.from_user.first_name or update.message.from_user.username or "Unknown"
_group_names[group_id] = update.message.chat.title or group_id
try:
# Download file to temp directory
tg_file = await doc.get_file()
tmp_dir = tempfile.mkdtemp()
file_path = os.path.join(tmp_dir, filename)
await tg_file.download_to_drive(file_path)
logger.info(f"Downloaded {filename} from {shared_by} in {_group_names.get(group_id, group_id)}")
# Ingest into knowledge base
signals = ingest_document(file_path, group_id, shared_by=shared_by, filename=filename)
if signals:
store_signals(group_id, signals)
await update.message.reply_text(
f"📄 Ingested *{filename}* — {len(signals)} knowledge chunks stored.\n"
f"You can now `/ask` questions about this document.",
parse_mode=None
)
else:
logger.info(f"No extractable text in {filename}")
except Exception as e:
logger.error(f"Document ingestion failed for {filename}: {e}")
finally:
# Cleanup temp file
try:
if os.path.exists(file_path):
os.remove(file_path)
os.rmdir(tmp_dir)
except Exception:
pass
```
**Register the handler in `run_bot()` — add this line BEFORE the text message handler:**
```python
app.add_handler(MessageHandler(filters.Document.ALL, handle_document))
```
So the handler section in `run_bot()` now looks like:
```python
app.add_handler(CommandHandler("start", cmd_start))
app.add_handler(CommandHandler("ask", cmd_ask))
app.add_handler(CommandHandler("digest", cmd_digest))
app.add_handler(CommandHandler("lens", cmd_lens))
app.add_handler(MessageHandler(filters.Document.ALL, handle_document)) # NEW
app.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, handle_message))
```
### ✅ TEST MILESTONE 11
Create file: `thirdeye/scripts/test_m11.py`
```python
"""Test Milestone 11: Document & PDF ingestion into RAG."""
import os, sys, tempfile
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
def test_text_extraction():
"""Test extraction from each supported file type."""
from backend.agents.document_ingestor import extract_text
# Test 1: Plain text file
print("Testing TXT extraction...")
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w", delete=False, encoding="utf-8")
tmp.write("This is a test document.\nIt has multiple lines.\nThird line about PostgreSQL decisions.")
tmp.close()
pages = extract_text(tmp.name)
assert len(pages) == 1, f"Expected 1 page, got {len(pages)}"
assert "PostgreSQL" in pages[0]["text"]
print(f" ✅ TXT extraction works ({len(pages[0]['text'])} chars)")
os.unlink(tmp.name)
# Test 2: DOCX file
print("Testing DOCX extraction...")
try:
from docx import Document
doc = Document()
doc.add_paragraph("Architecture Decision: We chose Redis for caching.")
doc.add_paragraph("Tech Debt: The API keys are hardcoded in config.py.")
doc.add_paragraph("Promise: Dashboard mockups will be ready by Friday March 21st.")
tmp_docx = tempfile.NamedTemporaryFile(suffix=".docx", delete=False)
doc.save(tmp_docx.name)
tmp_docx.close()
pages = extract_text(tmp_docx.name)
assert len(pages) == 1, f"Expected 1 page, got {len(pages)}"
assert "Redis" in pages[0]["text"]
print(f" ✅ DOCX extraction works ({len(pages[0]['text'])} chars)")
os.unlink(tmp_docx.name)
except ImportError:
print(" ⚠️ python-docx not installed, skipping DOCX test")
# Test 3: PDF file
print("Testing PDF extraction...")
try:
from PyPDF2 import PdfWriter
from io import BytesIO
# PyPDF2 can't easily create PDFs with text from scratch,
# so we test the extractor handles an empty/corrupt file gracefully
tmp_pdf = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False)
writer = PdfWriter()
writer.add_blank_page(width=612, height=792)
writer.write(tmp_pdf)
tmp_pdf.close()
pages = extract_text(tmp_pdf.name)
# Blank page = no text, should return empty gracefully
print(f" ✅ PDF extraction handles blank PDF gracefully ({len(pages)} pages with text)")
os.unlink(tmp_pdf.name)
except ImportError:
print(" ⚠️ PyPDF2 not installed, skipping PDF test")
# Test 4: Unsupported file type
print("Testing unsupported file type...")
tmp_bin = tempfile.NamedTemporaryFile(suffix=".exe", delete=False)
tmp_bin.write(b"binary data")
tmp_bin.close()
pages = extract_text(tmp_bin.name)
assert len(pages) == 0, "Should return empty for unsupported types"
print(f" ✅ Unsupported file type handled gracefully")
os.unlink(tmp_bin.name)
def test_chunking():
"""Test text chunking logic."""
from backend.agents.document_ingestor import chunk_text
print("\nTesting chunking...")
# Test 1: Short text — should NOT be split
short = "This is a short text that fits in one chunk."
chunks = chunk_text(short, max_chars=1500)
assert len(chunks) == 1, f"Short text should be 1 chunk, got {len(chunks)}"
print(f" ✅ Short text → 1 chunk")
# Test 2: Long text — should be split
long_text = "\n".join([f"This is paragraph {i} with enough content to fill the chunk. " * 5 for i in range(20)])
chunks = chunk_text(long_text, max_chars=500, overlap_chars=100)
assert len(chunks) > 1, f"Long text should produce multiple chunks, got {len(chunks)}"
print(f" ✅ Long text ({len(long_text)} chars) → {len(chunks)} chunks")
# Test 3: All chunks are within size limit (with some tolerance for overlap)
for i, c in enumerate(chunks):
# Overlap can push slightly over max_chars, that's fine
assert len(c) < 800, f"Chunk {i} too large: {len(c)} chars"
print(f" ✅ All chunks are within size bounds")
# Test 4: Empty text
chunks = chunk_text("")
assert len(chunks) == 1 and chunks[0] == "", "Empty text should return ['']"
print(f" ✅ Empty text handled")
def test_full_ingestion():
"""Test full ingestion pipeline: file → extract → chunk → signals → store → query."""
from backend.agents.document_ingestor import ingest_document
from backend.db.chroma import store_signals, query_signals
print("\nTesting full ingestion pipeline...")
# Create a realistic test document
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w", delete=False, encoding="utf-8")
tmp.write("""API Specification v2.0 — Acme Project
Authentication:
All endpoints require OAuth 2.0 Bearer tokens. The recommended flow for SPAs is Authorization Code with PKCE.
Tokens expire after 3600 seconds. Refresh tokens are valid for 30 days.
Endpoints:
POST /api/v2/orders — Create a new order. Requires 'orders:write' scope.
GET /api/v2/orders/{id} — Retrieve order details. Requires 'orders:read' scope.
DELETE /api/v2/orders/{id} — Cancel an order. Only allowed within 24 hours of creation.
Rate Limits:
Standard tier: 100 requests per minute.
Enterprise tier: 1000 requests per minute.
Rate limit headers (X-RateLimit-Remaining) are included in every response.
Compliance:
All data must be encrypted at rest using AES-256.
PII fields are redacted in logs automatically.
GDPR deletion requests must be processed within 72 hours.
The compliance deadline for the new data residency requirements is April 1st 2026.
""")
tmp.close()
group_id = "test_doc_m11"
# Ingest
signals = ingest_document(tmp.name, group_id, shared_by="Priya", filename="api_spec_v2.txt")
assert len(signals) > 0, f"Expected signals, got {len(signals)}"
print(f" ✅ Ingestion produced {len(signals)} signals")
# Verify signal structure
for s in signals:
assert s["type"] == "document_knowledge"
assert s["group_id"] == group_id
assert "@Priya" in s["entities"]
assert "api_spec_v2.txt" in s["entities"]
print(f" ✅ All signals have correct type and metadata")
# Store in ChromaDB
store_signals(group_id, signals)
print(f" ✅ Stored {len(signals)} document signals in ChromaDB")
# Query: can we find document content?
results = query_signals(group_id, "What authentication method is recommended?")
assert len(results) > 0, "No results for auth query"
found_auth = any("oauth" in r["document"].lower() or "auth" in r["document"].lower() for r in results)
assert found_auth, "Expected to find OAuth/auth info in results"
print(f" ✅ Query 'authentication method' returns relevant results")
results2 = query_signals(group_id, "What is the compliance deadline?")
assert len(results2) > 0, "No results for compliance query"
found_compliance = any("april" in r["document"].lower() or "compliance" in r["document"].lower() for r in results2)
assert found_compliance, "Expected to find compliance deadline in results"
print(f" ✅ Query 'compliance deadline' returns relevant results")
results3 = query_signals(group_id, "rate limits")
assert len(results3) > 0, "No results for rate limits query"
print(f" ✅ Query 'rate limits' returns {len(results3)} results")
# Cleanup
os.unlink(tmp.name)
import chromadb
from backend.config import CHROMA_DB_PATH
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
client.delete_collection(f"ll_{group_id}")
print(f" ✅ Cleaned up test collection")
except:
pass
def test_mixed_query():
"""Test that document signals AND chat signals coexist and are both queryable."""
from backend.agents.document_ingestor import ingest_document
from backend.pipeline import process_message_batch, query_knowledge
from backend.db.chroma import store_signals
import asyncio
print("\nTesting mixed query (documents + chat signals)...")
group_id = "test_mixed_m11"
# 1. Ingest a document
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w", delete=False, encoding="utf-8")
tmp.write("Architecture Decision Record: The team has selected Redis for session caching due to sub-millisecond latency.")
tmp.close()
doc_signals = ingest_document(tmp.name, group_id, shared_by="Priya", filename="adr_001.txt")
store_signals(group_id, doc_signals)
os.unlink(tmp.name)
# 2. Process some chat messages (that mention a DIFFERENT topic)
chat_messages = [
{"sender": "Alex", "text": "The timeout bug on checkout is back. Third time this sprint.", "timestamp": "2026-03-20T10:00:00Z"},
{"sender": "Sam", "text": "I think it's a database connection pool issue.", "timestamp": "2026-03-20T10:05:00Z"},
]
chat_signals = asyncio.run(process_message_batch(group_id, chat_messages))
# 3. Query for document knowledge
answer1 = asyncio.run(query_knowledge(group_id, "What caching solution was selected?"))
assert "redis" in answer1.lower() or "caching" in answer1.lower(), f"Expected Redis/caching mention, got: {answer1[:100]}"
print(f" ✅ Document query works: {answer1[:80]}...")
# 4. Query for chat knowledge
answer2 = asyncio.run(query_knowledge(group_id, "What bugs have been reported?"))
assert "timeout" in answer2.lower() or "bug" in answer2.lower(), f"Expected timeout/bug mention, got: {answer2[:100]}"
print(f" ✅ Chat query works alongside documents: {answer2[:80]}...")
# Cleanup
import chromadb
from backend.config import CHROMA_DB_PATH
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
client.delete_collection(f"ll_{group_id}")
except:
pass
print(f" ✅ Mixed query (document + chat) both return correct results")
test_text_extraction()
test_chunking()
test_full_ingestion()
test_mixed_query()
print("\n🎉 MILESTONE 11 PASSED — Document & PDF ingestion working")
```
Run: `cd thirdeye && python scripts/test_m11.py`
**Expected output:** All ✅ checks. Documents are extracted, chunked, stored in ChromaDB, and queryable alongside chat-extracted signals.
---
## MILESTONE 12: Tavily Web Search Tool (110%)
**Goal:** The Query Agent gains a web search fallback. When internal knowledge is insufficient OR the question is clearly about external/general topics, it calls Tavily for fresh web context. Also adds a `/search` command for explicit web search.
### Step 12.1 — Create the web search module
Create file: `thirdeye/backend/agents/web_search.py`
```python
"""Web Search Agent — Tavily integration for real-time web context."""
import logging
from backend.config import TAVILY_API_KEY, ENABLE_WEB_SEARCH
logger = logging.getLogger("thirdeye.agents.web_search")
_tavily_client = None
def _get_client():
global _tavily_client
if _tavily_client is None and TAVILY_API_KEY and len(TAVILY_API_KEY) > 5:
try:
from tavily import TavilyClient
_tavily_client = TavilyClient(api_key=TAVILY_API_KEY)
logger.info("Tavily client initialized")
except ImportError:
logger.error("tavily-python not installed. Run: pip install tavily-python")
except Exception as e:
logger.error(f"Tavily client init failed: {e}")
return _tavily_client
async def search_web(query: str, max_results: int = 5) -> list[dict]:
"""
Search the web using Tavily and return structured results.
Args:
query: Search query string
max_results: Max results to return (1-10)
Returns:
List of {title, url, content, score} dicts, sorted by relevance
"""
if not ENABLE_WEB_SEARCH:
logger.info("Web search is disabled via feature flag")
return []
client = _get_client()
if not client:
logger.warning("Tavily client not available (missing API key or install)")
return []
try:
response = client.search(
query=query,
max_results=max_results,
search_depth="basic", # "basic" is faster + free-tier friendly; "advanced" for deeper
include_answer=False,
include_raw_content=False,
)
results = []
for r in response.get("results", []):
results.append({
"title": r.get("title", ""),
"url": r.get("url", ""),
"content": r.get("content", ""),
"score": r.get("score", 0.0),
})
logger.info(f"Tavily returned {len(results)} results for: {query[:60]}")
return results
except Exception as e:
logger.error(f"Tavily search failed: {e}")
return []
def format_search_results_for_llm(results: list[dict]) -> str:
"""Format Tavily results into context string for the Query Agent."""
if not results:
return ""
parts = []
for i, r in enumerate(results):
content_preview = r["content"][:500] if r["content"] else "No content"
parts.append(
f"[Web Result {i+1}] {r['title']}\n"
f"Source: {r['url']}\n"
f"Content: {content_preview}"
)
return "\n\n".join(parts)
```
### Step 12.2 — Update `query_knowledge` in pipeline.py to use web search
Open `thirdeye/backend/pipeline.py` and **replace** the existing `query_knowledge` function with:
```python
async def query_knowledge(group_id: str, question: str, force_web_search: bool = False) -> str:
"""
Query the knowledge base with natural language, with optional web search fallback.
Flow:
1. Search internal knowledge base (ChromaDB)
2. If results are weak OR question is clearly external, also search the web
3. LLM synthesizes both sources into a final answer
"""
from backend.providers import call_llm
from backend.agents.web_search import search_web, format_search_results_for_llm
from backend.config import ENABLE_WEB_SEARCH
# Step 1: Internal RAG search
results = query_signals(group_id, question, n_results=8)
# Format internal context
internal_context = ""
if results:
context_parts = []
for i, r in enumerate(results):
meta = r["metadata"]
source_label = "Document" if meta.get("type") == "document_knowledge" else "Chat Signal"
context_parts.append(
f"[{source_label} {i+1}] Type: {meta.get('type', 'unknown')} | "
f"Severity: {meta.get('severity', 'unknown')} | "
f"Time: {meta.get('timestamp', 'unknown')}\n"
f"Content: {r['document']}\n"
f"Entities: {meta.get('entities', '[]')}"
)
internal_context = "\n\n".join(context_parts)
# Step 2: Decide whether to invoke web search
web_context = ""
used_web = False
# Determine if internal results are strong enough
has_strong_internal = (
len(results) >= 2
and results[0].get("relevance_score", 0) > 0.5
)
# Heuristics for when web search adds value
web_keywords = [
"latest", "current", "best practice", "industry", "how does",
"compare", "what is", "standard", "benchmark", "trend",
"security", "vulnerability", "update", "news", "release",
]
question_lower = question.lower()
wants_external = any(kw in question_lower for kw in web_keywords)
should_search_web = (
ENABLE_WEB_SEARCH
and (force_web_search or not has_strong_internal or wants_external)
)
if should_search_web:
web_results = await search_web(question, max_results=3)
if web_results:
web_context = format_search_results_for_llm(web_results)
used_web = True
# Step 3: Build combined prompt
if not internal_context and not web_context:
return "I don't have any information about that in the knowledge base yet, and web search didn't return relevant results. The group needs more conversation for me to learn from."
combined_context = ""
if internal_context:
combined_context += f"=== INTERNAL KNOWLEDGE BASE (from team conversations & documents) ===\n\n{internal_context}\n\n"
if web_context:
combined_context += f"=== WEB SEARCH RESULTS ===\n\n{web_context}\n\n"
system_prompt = """You are the Query Agent for ThirdEye. Answer questions using the provided context.
RULES:
1. PRIORITIZE internal knowledge base results — they come from the team's own conversations and documents.
2. Use web search results to SUPPLEMENT or provide additional context, not to override team decisions.
3. Clearly distinguish sources: "Based on your team's discussion..." vs "According to web sources..."
4. If info doesn't exist in any context, say so clearly.
5. Be concise — 2-4 sentences unless more is needed.
6. Format for Telegram (plain text, no markdown headers).
7. If you cite web sources, include the source name (not the full URL)."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n\n{combined_context}\n\nQuestion: {question}"},
]
try:
result = await call_llm("fast_large", messages, temperature=0.3, max_tokens=600)
answer = result["content"]
# Append a subtle indicator of sources used
sources = []
if internal_context:
sources.append("knowledge base")
if used_web:
sources.append("web search")
answer += f"\n\n📌 Sources: {' + '.join(sources)}"
return answer
except Exception as e:
logger.error(f"Query agent failed: {e}")
return "Sorry, I encountered an error while searching. Please try again."
```
### Step 12.3 — Add `/search` command to the bot
Open `thirdeye/backend/bot/bot.py` and add:
**Add import at the top:**
```python
from backend.agents.web_search import search_web, format_search_results_for_llm
from backend.config import ENABLE_WEB_SEARCH
```
**Add this command handler (after `cmd_lens`):**
```python
async def cmd_search(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Handle /search [query] — explicit web search."""
if not ENABLE_WEB_SEARCH:
await update.message.reply_text("🔍 Web search is currently disabled.")
return
if not context.args:
await update.message.reply_text("Usage: /search [your query]\nExample: /search FastAPI rate limiting best practices")
return
query = " ".join(context.args)
await update.message.reply_text(f"🌐 Searching the web for: {query}...")
try:
results = await search_web(query, max_results=3)
if not results:
await update.message.reply_text("No web results found. Try a different query.")
return
parts = [f"🌐 Web Search: {query}\n"]
for i, r in enumerate(results):
snippet = r["content"][:200] + "..." if len(r["content"]) > 200 else r["content"]
parts.append(f"{i+1}. {r['title']}\n{snippet}\n🔗 {r['url']}\n")
await update.message.reply_text("\n".join(parts))
except Exception as e:
await update.message.reply_text(f"Search failed: {str(e)[:100]}")
```
**Register the handler in `run_bot()` — add this line with the other CommandHandlers:**
```python
app.add_handler(CommandHandler("search", cmd_search))
```
**Update the `/start` welcome message to include the new commands:**
```python
async def cmd_start(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Welcome message."""
await update.message.reply_text(
"👁️ *ThirdEye* — Conversation Intelligence Engine\n\n"
"I'm now listening to this group and extracting intelligence from your conversations.\n\n"
"Commands:\n"
"/ask [question] — Ask about your team's knowledge\n"
"/search [query] — Search the web for external info\n"
"/digest — Get an intelligence summary\n"
"/lens [mode] — Set detection mode (dev/product/client/community)\n"
"/alerts — View active warnings\n\n"
"📄 Share documents (PDF, DOCX, TXT) — I'll ingest them into the knowledge base.\n"
"🔗 Share links — I'll fetch and store their content.\n\n"
"I work passively — no need to tag me. I'll alert you when I spot patterns or issues.",
parse_mode=None
)
```
### ✅ TEST MILESTONE 12
Create file: `thirdeye/scripts/test_m12.py`
```python
"""Test Milestone 12: Tavily web search integration."""
import asyncio, os, sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
async def test_tavily_connection():
"""Test that Tavily API is reachable and returns results."""
from backend.agents.web_search import search_web
print("Testing Tavily API connection...")
results = await search_web("FastAPI rate limiting best practices", max_results=3)
if not results:
print(" ⚠️ No results returned (check TAVILY_API_KEY in .env)")
print(" ⚠️ If key is missing, get one at: https://tavily.com")
return False
assert len(results) > 0, "Expected at least 1 result"
assert results[0]["title"], "Result missing title"
assert results[0]["url"], "Result missing URL"
assert results[0]["content"], "Result missing content"
print(f" ✅ Tavily returned {len(results)} results")
for r in results:
print(f" - {r['title'][:60]} ({r['url'][:50]}...)")
return True
async def test_format_results():
"""Test result formatting for LLM context."""
from backend.agents.web_search import search_web, format_search_results_for_llm
print("\nTesting result formatting...")
results = await search_web("Python async programming", max_results=2)
if results:
formatted = format_search_results_for_llm(results)
assert "[Web Result 1]" in formatted
assert "Source:" in formatted
assert len(formatted) > 50
print(f" ✅ Formatted context: {len(formatted)} chars")
else:
print(" ⚠️ Skipped (no results to format)")
async def test_query_with_web_fallback():
"""Test that query_knowledge uses web search when internal KB is empty."""
from backend.pipeline import query_knowledge
print("\nTesting query with web search fallback...")
# Use a group with no data — forces web search fallback
empty_group = "test_empty_web_m12"
answer = await query_knowledge(empty_group, "What is the latest version of Python?")
print(f" Answer: {answer[:150]}...")
# Should have used web search since internal KB is empty
assert len(answer) > 20, f"Answer too short: {answer}"
assert "sources" in answer.lower() or "web" in answer.lower() or "python" in answer.lower(), \
"Expected web-sourced answer about Python"
print(f" ✅ Web fallback produced a meaningful answer")
async def test_query_prefers_internal():
"""Test that internal knowledge is preferred over web when available."""
from backend.pipeline import process_message_batch, query_knowledge, set_lens
print("\nTesting internal knowledge priority over web...")
group_id = "test_internal_prio_m12"
set_lens(group_id, "dev")
# Seed some very specific internal knowledge
messages = [
{"sender": "Alex", "text": "Team decision: We are using Python 3.11 specifically, not 3.12, because of the ML library compatibility issue.", "timestamp": "2026-03-20T10:00:00Z"},
{"sender": "Priya", "text": "Confirmed, 3.11 is locked in. I've updated the Dockerfile.", "timestamp": "2026-03-20T10:05:00Z"},
]
await process_message_batch(group_id, messages)
answer = await query_knowledge(group_id, "What Python version are we using?")
print(f" Answer: {answer[:150]}...")
# Should reference internal knowledge (3.11) not latest web info
assert "3.11" in answer or "python" in answer.lower(), \
f"Expected internal knowledge about Python 3.11, got: {answer[:100]}"
print(f" ✅ Internal knowledge (Python 3.11) is prioritized in answer")
# Cleanup
import chromadb
from backend.config import CHROMA_DB_PATH
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
client.delete_collection(f"ll_{group_id}")
except:
pass
async def test_explicit_search():
"""Test the /search style direct web search."""
from backend.agents.web_search import search_web
print("\nTesting explicit web search (for /search command)...")
results = await search_web("OWASP top 10 2025", max_results=3)
if results:
assert len(results) <= 3
print(f" ✅ Explicit search returned {len(results)} results")
for r in results:
print(f" - {r['title'][:60]}")
else:
print(" ⚠️ No results (Tavily key may be missing)")
async def main():
tavily_ok = await test_tavily_connection()
if tavily_ok:
await test_format_results()
await test_query_with_web_fallback()
await test_query_prefers_internal()
await test_explicit_search()
print("\n🎉 MILESTONE 12 PASSED — Web search integration working")
else:
print("\n⚠️ MILESTONE 12 PARTIAL — Tavily API key not configured")
print(" The code is correct but needs a valid TAVILY_API_KEY in .env")
print(" Get one free at: https://tavily.com")
asyncio.run(main())
```
Run: `cd thirdeye && python scripts/test_m12.py`
**Expected output:** All ✅ checks. Tavily returns results. Internal knowledge is prioritized over web results. Web search fills gaps when knowledge base is empty.
---
## MILESTONE 13: Link Fetch & Ingestion (115%)
**Goal:** When a URL is shared in a Telegram group, the bot attempts to fetch the page content, summarize it with an LLM, and store the summary as a `link_knowledge` signal in ChromaDB. Fails gracefully and silently if the link is inaccessible.
### Step 13.1 — Create the Link Fetcher
Create file: `thirdeye/backend/agents/link_fetcher.py`
```python
"""Link Fetcher — extracts, summarizes, and stores content from URLs shared in chat."""
import re
import uuid
import logging
import asyncio
from datetime import datetime
import httpx
from bs4 import BeautifulSoup
from backend.providers import call_llm
from backend.config import ENABLE_LINK_FETCH
logger = logging.getLogger("thirdeye.agents.link_fetcher")
# Patterns to skip (images, downloads, social media embeds, etc.)
SKIP_PATTERNS = [
r"\.(png|jpg|jpeg|gif|svg|webp|ico|bmp)(\?.*)?$",
r"\.(zip|tar|gz|rar|7z|exe|msi|dmg|apk|deb)(\?.*)?$",
r"\.(mp3|mp4|avi|mov|mkv|wav|flac)(\?.*)?$",
r"^https?://(www\.)?(twitter|x)\.com/.*/status/",
r"^https?://(www\.)?instagram\.com/p/",
r"^https?://(www\.)?tiktok\.com/",
r"^https?://(www\.)?youtube\.com/shorts/",
r"^https?://t\.me/", # Other Telegram links
]
SKIP_COMPILED = [re.compile(p, re.IGNORECASE) for p in SKIP_PATTERNS]
def extract_urls(text: str) -> list[str]:
"""Extract all HTTP/HTTPS URLs from a text string."""
url_pattern = re.compile(
r"https?://[^\s<>\"')\]},;]+"
)
urls = url_pattern.findall(text)
# Clean trailing punctuation
cleaned = []
for url in urls:
url = url.rstrip(".,;:!?)")
if len(url) > 10:
cleaned.append(url)
return cleaned
def should_fetch(url: str) -> bool:
"""Decide if a URL is worth fetching (skip images, downloads, social embeds)."""
for pattern in SKIP_COMPILED:
if pattern.search(url):
return False
return True
async def fetch_url_content(url: str, timeout: float = 15.0) -> dict | None:
"""
Fetch a URL and extract main text content.
Returns:
{title, text, url} or None if fetch fails
"""
try:
async with httpx.AsyncClient(
follow_redirects=True,
timeout=timeout,
headers={
"User-Agent": "Mozilla/5.0 (compatible; ThirdEye/1.0; +https://thirdeye.dev)",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
},
) as client:
response = await client.get(url)
if response.status_code != 200:
logger.info(f"URL returned {response.status_code}: {url[:80]}")
return None
content_type = response.headers.get("content-type", "")
if "text/html" not in content_type and "application/xhtml" not in content_type:
logger.info(f"Skipping non-HTML content ({content_type}): {url[:80]}")
return None
html = response.text
except httpx.TimeoutException:
logger.info(f"URL timed out: {url[:80]}")
return None
except Exception as e:
logger.info(f"URL fetch failed ({type(e).__name__}): {url[:80]}")
return None
# Parse HTML
try:
soup = BeautifulSoup(html, "html.parser")
# Extract title
title = ""
if soup.title and soup.title.string:
title = soup.title.string.strip()
# Remove script, style, nav, footer, header elements
for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript", "form"]):
tag.decompose()
# Try to find main content area
main = soup.find("main") or soup.find("article") or soup.find("div", {"role": "main"})
if main:
text = main.get_text(separator="\n", strip=True)
else:
text = soup.get_text(separator="\n", strip=True)
# Clean up
lines = [line.strip() for line in text.split("\n") if line.strip()]
text = "\n".join(lines)
# Skip if too little content
if len(text) < 100:
logger.info(f"Too little text content ({len(text)} chars): {url[:80]}")
return None
# Truncate very long content
if len(text) > 8000:
text = text[:8000] + "\n\n[Content truncated]"
return {
"title": title or url,
"text": text,
"url": url,
}
except Exception as e:
logger.warning(f"HTML parsing failed for {url[:80]}: {e}")
return None
async def summarize_content(title: str, text: str, url: str) -> str:
"""Use LLM to create a concise summary of fetched content."""
# Limit text sent to LLM
text_preview = text[:3000]
messages = [
{"role": "system", "content": """You are a content summarizer for ThirdEye.
Given the title and text of a web page, produce a concise 2-4 sentence summary that captures the key information.
Focus on: main topic, key facts, any actionable insights, any deadlines or decisions mentioned.
Respond with ONLY the summary text, nothing else."""},
{"role": "user", "content": f"Title: {title}\nURL: {url}\n\nContent:\n{text_preview}"},
]
try:
result = await call_llm("fast_small", messages, temperature=0.2, max_tokens=300)
return result["content"].strip()
except Exception as e:
logger.warning(f"Link summarization failed: {e}")
# Fallback: use first 200 chars of text
return text[:200] + "..."
async def process_links_from_message(
text: str,
group_id: str,
shared_by: str = "Unknown",
) -> list[dict]:
"""
Full pipeline: extract URLs from message → fetch → summarize → produce signals.
Designed to be called in the background (non-blocking to the main message pipeline).
Returns:
List of signal dicts ready for store_signals()
"""
if not ENABLE_LINK_FETCH:
return []
urls = extract_urls(text)
fetchable = [u for u in urls if should_fetch(u)]
if not fetchable:
return []
signals = []
# Process up to 3 links per message to avoid overload
for url in fetchable[:3]:
try:
content = await fetch_url_content(url)
if not content:
continue
summary = await summarize_content(content["title"], content["text"], url)
signal = {
"id": str(uuid.uuid4()),
"type": "link_knowledge",
"summary": f"[Link: {content['title'][:80]}] {summary[:200]}",
"entities": [f"@{shared_by}", url[:100]],
"severity": "low",
"status": "reference",
"sentiment": "neutral",
"urgency": "none",
"raw_quote": summary,
"timestamp": datetime.utcnow().isoformat(),
"group_id": group_id,
"lens": "link",
"keywords": [content["title"][:50], "link", "web", shared_by],
}
signals.append(signal)
logger.info(f"Link ingested: {content['title'][:50]} ({url[:60]})")
except Exception as e:
logger.warning(f"Link processing failed for {url[:60]}: {e}")
continue
return signals
```
### Step 13.2 — Integrate link fetching into the Telegram bot
Open `thirdeye/backend/bot/bot.py` and add:
**Add import at the top:**
```python
from backend.agents.link_fetcher import extract_urls, process_links_from_message
from backend.config import ENABLE_LINK_FETCH
```
**Modify the existing `handle_message` function** to add link detection at the end. Replace the entire `handle_message` function with:
```python
async def handle_message(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Process every text message in groups."""
if not update.message or not update.message.text:
return
if not update.message.chat.type in ("group", "supergroup"):
return
group_id = str(update.message.chat_id)
_group_names[group_id] = update.message.chat.title or group_id
text = update.message.text
sender = update.message.from_user.first_name or update.message.from_user.username or "Unknown"
msg = {
"sender": sender,
"text": text,
"timestamp": update.message.date.isoformat(),
"message_id": update.message.message_id,
}
_buffers[group_id].append(msg)
# Process when buffer reaches batch size
if len(_buffers[group_id]) >= BATCH_SIZE:
batch = _buffers[group_id]
_buffers[group_id] = []
try:
signals = await process_message_batch(group_id, batch)
if signals:
logger.info(f"Processed batch: {len(signals)} signals from {_group_names.get(group_id, group_id)}")
except Exception as e:
logger.error(f"Pipeline error: {e}")
# Background: process links if message contains URLs
if ENABLE_LINK_FETCH and extract_urls(text):
asyncio.create_task(_process_links_background(text, group_id, sender))
async def _process_links_background(text: str, group_id: str, sender: str):
"""Process links from a message in the background (non-blocking)."""
try:
link_signals = await process_links_from_message(text, group_id, shared_by=sender)
if link_signals:
store_signals(group_id, link_signals)
logger.info(f"Stored {len(link_signals)} link signals for {group_id}")
except Exception as e:
logger.error(f"Background link processing failed: {e}")
```
### ✅ TEST MILESTONE 13
Create file: `thirdeye/scripts/test_m13.py`
```python
"""Test Milestone 13: Link fetch & ingestion."""
import asyncio, os, sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
def test_url_extraction():
"""Test URL extraction from message text."""
from backend.agents.link_fetcher import extract_urls
print("Testing URL extraction...")
# Test 1: Simple URL
urls = extract_urls("Check this out https://example.com/article")
assert len(urls) == 1
assert urls[0] == "https://example.com/article"
print(f" ✅ Simple URL extracted")
# Test 2: Multiple URLs
urls = extract_urls("See https://github.com/issue/123 and also https://docs.python.org/3/library/asyncio.html for reference")
assert len(urls) == 2
print(f" ✅ Multiple URLs extracted: {len(urls)}")
# Test 3: URL with trailing punctuation
urls = extract_urls("Visit https://example.com/page.")
assert len(urls) == 1
assert not urls[0].endswith(".")
print(f" ✅ Trailing punctuation stripped")
# Test 4: No URLs
urls = extract_urls("This message has no links at all")
assert len(urls) == 0
print(f" ✅ No URLs returns empty list")
# Test 5: URL with query params
urls = extract_urls("https://example.com/search?q=test&page=2")
assert len(urls) == 1
assert "q=test" in urls[0]
print(f" ✅ URL with query params preserved")
def test_should_fetch():
"""Test URL filtering logic."""
from backend.agents.link_fetcher import should_fetch
print("\nTesting URL filter (should_fetch)...")
# Should fetch
assert should_fetch("https://github.com/org/repo/issues/347") == True
assert should_fetch("https://docs.python.org/3/library/asyncio.html") == True
assert should_fetch("https://blog.example.com/how-to-rate-limit") == True
print(f" ✅ Valid URLs pass filter")
# Should NOT fetch
assert should_fetch("https://example.com/photo.png") == False
assert should_fetch("https://example.com/image.jpg?size=large") == False
assert should_fetch("https://example.com/release.zip") == False
assert should_fetch("https://example.com/video.mp4") == False
print(f" ✅ Image/download/media URLs filtered out")
# Social media skips
assert should_fetch("https://t.me/somechannel/123") == False
print(f" ✅ Social media URLs filtered out")
async def test_fetch_content():
"""Test fetching actual web page content."""
from backend.agents.link_fetcher import fetch_url_content
print("\nTesting URL content fetch...")
# Test 1: Fetch a reliable public page
content = await fetch_url_content("https://httpbin.org/html")
if content:
assert content["text"], "Expected text content"
assert content["url"] == "https://httpbin.org/html"
print(f" ✅ Fetched httpbin.org/html: {len(content['text'])} chars, title='{content['title'][:40]}'")
else:
print(f" ⚠️ httpbin.org unreachable (network may be restricted)")
# Test 2: Graceful failure on non-existent page
content = await fetch_url_content("https://httpbin.org/status/404")
assert content is None, "Expected None for 404 page"
print(f" ✅ 404 page returns None (graceful failure)")
# Test 3: Graceful failure on timeout
content = await fetch_url_content("https://httpbin.org/delay/30", timeout=2.0)
assert content is None, "Expected None for timeout"
print(f" ✅ Timeout returns None (graceful failure)")
# Test 4: Graceful failure on invalid domain
content = await fetch_url_content("https://this-domain-definitely-does-not-exist-12345.com")
assert content is None, "Expected None for invalid domain"
print(f" ✅ Invalid domain returns None (graceful failure)")
async def test_summarization():
"""Test LLM summarization of fetched content."""
from backend.agents.link_fetcher import summarize_content
print("\nTesting content summarization...")
sample_title = "Understanding Rate Limiting in FastAPI"
sample_text = """Rate limiting is a technique to control the number of requests a client can make to an API.
In FastAPI, you can implement rate limiting using middleware or third-party packages like slowapi.
The most common approach is the token bucket algorithm, which allows burst traffic while maintaining
an average rate. For production systems, consider using Redis as a backend for distributed rate limiting
across multiple server instances. Key considerations include: setting appropriate limits per endpoint,
using different limits for authenticated vs anonymous users, and returning proper 429 status codes
with Retry-After headers."""
summary = await summarize_content(sample_title, sample_text, "https://example.com/rate-limiting")
assert len(summary) > 20, f"Summary too short: {summary}"
assert len(summary) < 1000, f"Summary too long: {len(summary)} chars"
print(f" ✅ Summary generated: {summary[:100]}...")
async def test_full_link_pipeline():
"""Test full pipeline: message with URL → fetch → summarize → store → query."""
from backend.agents.link_fetcher import process_links_from_message
from backend.db.chroma import store_signals, query_signals
print("\nTesting full link ingestion pipeline...")
group_id = "test_links_m13"
# Simulate a message with a URL
# Using httpbin.org/html which returns a simple HTML page
message_text = "Check out this page for reference: https://httpbin.org/html"
signals = await process_links_from_message(message_text, group_id, shared_by="Sam")
if signals:
assert len(signals) > 0
assert signals[0]["type"] == "link_knowledge"
assert signals[0]["group_id"] == group_id
assert "@Sam" in signals[0]["entities"]
print(f" ✅ Link pipeline produced {len(signals)} signals")
# Store and query
store_signals(group_id, signals)
results = query_signals(group_id, "what was shared from the web")
assert len(results) > 0, "Expected query results after storing link signals"
print(f" ✅ Link signals stored and queryable ({len(results)} results)")
# Cleanup
import chromadb
from backend.config import CHROMA_DB_PATH
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
client.delete_collection(f"ll_{group_id}")
except:
pass
else:
print(f" ⚠️ No signals produced (httpbin.org may be unreachable in this environment)")
async def test_mixed_with_chat_and_docs():
"""Test that link signals coexist with chat and document signals."""
from backend.agents.link_fetcher import process_links_from_message
from backend.agents.document_ingestor import ingest_document
from backend.pipeline import process_message_batch, query_knowledge, set_lens
from backend.db.chroma import store_signals
import tempfile
print("\nTesting all three signal types together...")
group_id = "test_all_sources_m13"
set_lens(group_id, "dev")
# 1. Chat signals
chat_messages = [
{"sender": "Alex", "text": "We decided to use PostgreSQL for the main DB.", "timestamp": "2026-03-20T10:00:00Z"},
{"sender": "Priya", "text": "I'll set up the schema and run migrations today.", "timestamp": "2026-03-20T10:05:00Z"},
]
await process_message_batch(group_id, chat_messages)
print(f" ✅ Chat signals stored")
# 2. Document signals
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w", delete=False, encoding="utf-8")
tmp.write("Security Policy: All API endpoints must use OAuth 2.0. JWT tokens expire after 1 hour.")
tmp.close()
doc_signals = ingest_document(tmp.name, group_id, shared_by="Priya", filename="security_policy.txt")
store_signals(group_id, doc_signals)
os.unlink(tmp.name)
print(f" ✅ Document signals stored")
# 3. Link signals
link_signals = await process_links_from_message(
"Relevant: https://httpbin.org/html",
group_id,
shared_by="Sam"
)
if link_signals:
store_signals(group_id, link_signals)
print(f" ✅ Link signals stored")
else:
print(f" ⚠️ Link signals skipped (network restriction)")
# 4. Query across all sources
answer = await query_knowledge(group_id, "What database are we using?")
assert "postgres" in answer.lower() or "database" in answer.lower()
print(f" ✅ Chat knowledge queryable: {answer[:80]}...")
answer2 = await query_knowledge(group_id, "What is the security policy?")
assert "oauth" in answer2.lower() or "jwt" in answer2.lower() or "security" in answer2.lower()
print(f" ✅ Document knowledge queryable: {answer2[:80]}...")
# Cleanup
import chromadb
from backend.config import CHROMA_DB_PATH
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
client.delete_collection(f"ll_{group_id}")
except:
pass
print(f" ✅ All three signal types coexist and are queryable")
async def main():
test_url_extraction()
test_should_fetch()
await test_fetch_content()
await test_summarization()
await test_full_link_pipeline()
await test_mixed_with_chat_and_docs()
print("\n🎉 MILESTONE 13 PASSED — Link fetch & ingestion working")
asyncio.run(main())
```
Run: `cd thirdeye && python scripts/test_m13.py`
**Expected output:** All ✅ checks. URLs are extracted, content is fetched (with graceful failures for 404/timeout/invalid), summaries are generated, signals are stored, and they're queryable alongside chat and document signals.
---
## MILESTONE SUMMARY (Updated)
| # | Milestone | What You Have | % |
|---|---|---|---|
| 0 | Scaffolding | Folders, deps, env vars, all API keys | 0% |
| 1 | Provider Router | Multi-provider LLM calls with fallback | 10% |
| 2 | ChromaDB + Embeddings | Store and retrieve signals with vector search | 20% |
| 3 | Core Agents | Signal Extractor + Classifier + Context Detector | 30% |
| 4 | Full Pipeline | Messages → Extract → Classify → Store → Query | 45% |
| 5 | Intelligence Layer | Pattern detection + Cross-group analysis | 60% |
| 6 | Telegram Bot | Live bot processing group messages | 70% |
| 7 | FastAPI + Dashboard API | REST API serving all data | 85% |
| 8 | Unified Runner | Bot + API running together | 90% |
| 9 | Demo Data | 3 groups seeded with realistic data | 95% |
| 10 | Polish & Demo Ready | README, rehearsed demo, everything working | 100% |
| **11** | **Document & PDF Ingestion** | **PDFs/DOCX/TXT shared in groups → chunked → stored in RAG** | **105%** |
| **12** | **Tavily Web Search** | **Query Agent searches web when KB is empty or question is external** | **110%** |
| **13** | **Link Fetch & Ingestion** | **URLs in messages → fetched → summarized → stored as signals** | **115%** |
---
## FILE CHANGE SUMMARY
### New Files Created
```
thirdeye/backend/agents/document_ingestor.py # Milestone 11
thirdeye/backend/agents/web_search.py # Milestone 12
thirdeye/backend/agents/link_fetcher.py # Milestone 13
thirdeye/scripts/test_m11.py # Milestone 11 test
thirdeye/scripts/test_m12.py # Milestone 12 test
thirdeye/scripts/test_m13.py # Milestone 13 test
```
### Existing Files Modified
```
thirdeye/requirements.txt # Pre-work: 4 new deps
thirdeye/.env # Pre-work: TAVILY_API_KEY + feature flags
thirdeye/backend/config.py # Pre-work: new config vars
thirdeye/backend/bot/bot.py # M11: handle_document, M12: cmd_search, M13: link detection
thirdeye/backend/pipeline.py # M12: updated query_knowledge with web search
```
### Updated Repo Structure (additions only)
```
thirdeye/
├── backend/
│ ├── agents/
│ │ ├── document_ingestor.py # NEW — PDF/DOCX/TXT extraction + chunking
│ │ ├── web_search.py # NEW — Tavily web search integration
│ │ └── link_fetcher.py # NEW — URL extraction, fetch, summarize
│ └── bot/
│ └── bot.py # MODIFIED — document handler, /search cmd, link detection
└── scripts/
├── test_m11.py # NEW — document ingestion tests
├── test_m12.py # NEW — web search tests
└── test_m13.py # NEW — link fetch tests
```
---
## UPDATED COMMANDS REFERENCE
```
/start — Welcome message (updated with new features)
/ask [q] — Query knowledge base (now with web search fallback)
/search [q] — NEW: Explicit web search via Tavily
/digest — Intelligence summary
/lens [mode] — Set/check detection lens
/alerts — View active warnings
PASSIVE (no command needed):
• Text messages → batched → signal extraction (existing)
• Document drops → downloaded → chunked → stored (NEW)
• URLs in messages → fetched → summarized → stored (NEW)
```
---
*Every milestone has a test. Every test must pass. No skipping.*