Document Question and Answering

Overview

One of the most popular use cases for LLMs is doing question answering (Q/A) over a large corpus of unstructured data.

Fixie makes it possible to quickly build Q/A Agents by automatically crawling a set of developer-provided URLs, generating embeddings, chunking the data, and storing it inside of a Vector Database for efficient retrieval. For data sets that aren't easily available to a web crawler, Fixie also supports custom corpora.

To get started, let's look at a simple CodeShotAgent that answers questions about Python:

import fixieai

BASE_PROMPT = """I am a helpful assistant that can answer questions \
  about Python. I try to be as concise as possible \
  in my answers while still effectively answering the question."""

URLS = [
    "https://docs.python.org/3.11/*"
]

DOCUMENTS = [fixieai.DocumentCorpus(urls=URLS)]
agent = fixieai.CodeShotAgent(BASE_PROMPT, [], DOCUMENTS, conversational=True)

You'll notice that we created a new variable, URLS, that contains an array of webpages that we would like Fixie to automatically crawl. (See Specifying URLs for detail about how this works.)

Once we have our URLS, we instantiate a new CodeShotAgent with our BASE_PROMPT, an empty Few Shot array, a set of DOCUMENTS that point to our URLs, and whether we want a conversational Agent (conversational agents keep previous questions and answers in memory, so we recommend it for Q/A Agents).

When a DocumentCorpus is provided, FewShots are optional (hence the empty array that we passed in). If you need FewShots, see Using Fewshots with Docs.

Once ready, you can deploy your agent using fixie deploy. The agent will deploy immediately, but will not answer questions until indexing is complete. See Monitoring Indexing Status for more.

Specifying URLs

You can specify URLs in two forms: static or wildcard.

Kind Example Use-case
Static https://docs.python.org/3.11/tutorial/introduction.html Index only the specified URL.
Wildcard https://docs.python.org/3.11/* Index the specified URL and all of its subpages.

When you specify a wildcard, Fixie will:

  • Crawl the site, respecting a sitemap if it exists.
  • Limit the crawl to pages that start with the specified pattern. For example:
# Wildcard pattern: https://foo.com/bar/*

https://foo.com/bar/index.html

    Links to -> https://foo.com/bar/nested/index.html
    Links to -> https://foo.com/bar/deep/nested/index.html
        These pages will be indexed.

    Links to -> https://foo.com/index.html
        This page will not be indexed, because it doesn't start with the wildcard pattern.

    Links to -> https://different-domain.com
        This page will not be indexed, because it doesn't start with the wildcard pattern.

Accessing Private URLs with Bearer Token

In many cases, the set of documents that you want to access will be behind authentication. Fixie supports passing in a Bearer token with each request. To enable this, you'll define a new function that returns a token that will be passed in to the Authorization HTTP header as follows: Authorization: Bearer {your_token}.

Here's an example:

import fixieai

BASE_PROMPT = """I can answer questions from secret, non-public docs."""

URLS = [
    "https://private.myamazingsite.com/*"
]

CORPORA = [fixieai.DocumentCorpus(urls=URLS, auth_token_func="auth")]
agent = fixieai.CodeShotAgent(BASE_PROMPT, [], CORPORA, conversational=True)

# Don't forget to call @agent.register_func on your auth function
@agent.register_func
def auth():
    return "12345"

Custom Corpora

If web crawling isn't a good fit for your use case, you can register a corpus function to load documents any way you need to. This allows for reading a private database for example.

import fixieai
from datetime import datetime

BASE_PROMPT = """I am a helpful assistant that can answer questions \
  about the contents of MyDatabase. I try to be as concise as possible \
  in my answers while still effectively answering the question."""

agents = fixieai.CodeShotAgent(BASE_PROMPT, [], conversational=True)

class MyDatabasePageToken:
    read_timestamp: datetime
    offset: int = 0

    def encode(self) -> str:
        # Serialize to a utf-8 encoded string
        pass

    @staticmethod
    def decode(token: str) -> MyDatabasePageToken:
        # Undo encode
        pass

PAGE_SIZE = 20

@agent.register_corpus_func
def load_my_database(request: fixieai.CorpusRequest) -> fixie.CorpusResponse:
    if not request.partition:
        initial_read = db_client.read("MyTopLevelTable", keys_only=True)
        first_page_token = MyDatabasePageToken(initial_read.timestamp())
        partition_names = [row.key for row in initial_read.results]
        return fixieai.CorpusResponse(partitions=[fixieai.CorpusPartition(name, first_page_token) for name in partition_names])
    else:
        top_level_key = request.partition
        token = MyDatabasePageToken.decode(request.page_token)
        offset = token.offset * PAGE_SIZE
        read = db_client.read("MyNestedTable", parent_key=top_level_key, limit=PAGE_SIZE, offset=offset, read_timestamp=token.read_timestamp)
        docs = [fixie.CorpusDocument(row.key, row.my_column) for row in read.results]
        if read.has_more_results():
            token.offset += 1
            next_page_token = token.encode()
        else:
            next_page_token = None
        return fixieai.CorpusResponse(page=fixieai.CorpusPage(docs, next_page_token))

This example assumes an easily partitionable database with columns that already have text data, but there are way more options available. See CorpusRequest for more details.

Using Fewshots with Docs

By default, you don't need to use FewShots with Docs. Fixie's automatic behavior will produce a Q/A agent that meets the general need.

If using FewShots, you'll need to manually tell Fixie to query the corpus using the built-in fixie_query_corpus function.

Here's an example for an Agent that has access to docs about the primary plot points from HBO's Silicon Valley:

FEW_SHOTS = """
Q: Who was Gilfoyle played by?
Ask Func[fixie_query_corpus]: Who was Gilfoyle played by?
Func[fixie_query_corpus] says: Gilfoyle was played by Martin Starr.
A: Gilfoyle was played by Martin Starr.
"""

Excluding documents during crawl

There are some cases where you don't want Fixie to crawl certain wepages or types of webpages when using the * wildcard pattern. In order to do this, you can supply a set of exclude_patterns, which are an array of regular expressions:

import fixieai

BASE_PROMPT = """I answer questions based on the supplied corpus"""

URLS = [
    "https://public.myamazingsite.com/*"
]

# Don't index any PDFs
EXCLUDE_PATTERNS = [
  "*.pdf",
]

CORPORA = [fixieai.DocumentCorpus(urls=URLS, exclude_patterns=EXCLUDE_PATTERNS)]
agent = fixieai.CodeShotAgent(BASE_PROMPT, [], CORPORA, conversational=True)

Supported file types

File Type Extension
Documents .doc, .docx, .ppt, .pptx
PDFs .pdf
Webpages .html
Text .md, .txt, .rtf,
Email .msg, .eml
E-Books .epub

Monitoring Indexing Status

You can check the indexing status of your Agent by asking it any question. Your agent will return I'm still starting up. Please try again in a few minutes if it's still indexing. Indexing can take upwards of a few hours for very large data sets (smaller sets can be done in just a few minutes).

Coming soon is the ability to see and manage your indexing directly on the Fixie dev console.