赤峰做企業(yè)網(wǎng)站公司企業(yè)網(wǎng)站建設方案策劃
文章目錄
- 前言
- self querying 簡介
- 代碼實現(xiàn)
- 總結
前言
現(xiàn)在比較流行的 RAG 檢索就是通過大模型 embedding 算法將數(shù)據(jù)嵌入向量數(shù)據(jù)庫中,然后在將用戶的查詢向量化,從向量數(shù)據(jù)庫中召回相似性數(shù)據(jù),構造成 context template, 放到 LLM 中進行查詢。
如果說將用戶的查詢語句直接轉換為向量查詢可能并不會得到很好的結果,比如說我們往向量數(shù)據(jù)庫中存入了一些商品向量,現(xiàn)在用戶說:“我想要一條價格低于20塊的黑色羊毛衫”,如果使用傳統(tǒng)的嵌入算法,該查詢語句轉換為向量查詢就可能“失幀”,被轉換為查詢黑色羊毛衫。
針對這種情況我們就會使用一些優(yōu)化檢索查詢語句方式來優(yōu)化 RAG 查詢,其中 langchain 的 self-querying 就是一種很好的方式,這里使用阿里云的 DashVector 向量數(shù)據(jù)庫和 DashScope LLM 來進行嘗試,優(yōu)化后的查詢效果還是挺不錯的。
現(xiàn)在很多網(wǎng)上的資料都是使用 OpenAI 的 Embedding 和 LLM,但是個人角色現(xiàn)在國內阿里的 LLM 和向量數(shù)據(jù)庫已經(jīng)非常好了,而且 OpenAI 已經(jīng)禁用了國內的 API 調用,國內的云服務又便宜又好用,真的不嘗試一下么?關于 DashVector 和 DashScope 我之前寫了幾篇實踐篇,大家感興趣的可以參考下:
LLM-文本分塊(langchain)與向量化(阿里云DashVector)存儲,嵌入LLM實踐
LLM-阿里云 DashVector + ModelScope 多模態(tài)向量化實時文本搜圖實戰(zhàn)總結
LLM-langchain 與阿里 DashScop (通義千問大模型) 和 DashVector(向量數(shù)據(jù)庫) 結合使用總結
前提條件
- 確保開通了通義千問 API key 和 向量檢索服務 API KEY
- 安裝依賴:
pip install langchain
pip install langchain-community
pip install dashVector
pip install dashscope
self querying 簡介
簡單來說就是通過 self-querying 的方式我們可以將用戶的查詢語句進行結構化轉換,轉換為包含兩層意思的向量化數(shù)據(jù):
- Query: 和查詢語義相近的向量查詢
- Filter: 關于查詢內容的一些 metadata 數(shù)據(jù)
比如說上圖中用戶輸入:“bar 說了關于 foo 的什么東西?”,self-querying 結構化轉換后就變?yōu)榱藘蓪雍x:
- 查詢關于 foo 的數(shù)據(jù)
- 其中作者為 bar
代碼實現(xiàn)
將DASHSCOPE_API_KEY
, DASHVECTOR_API_KEY
, DASHVECTOR_ENDPOINT
替換為自己在阿里云開通的。
import osfrom langchain_core.documents import Document
from langchain_community.vectorstores.dashvector import DashVector
from langchain_community.embeddings.dashscope import DashScopeEmbeddings
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_community.chat_models.tongyi import ChatTongyi
from langchain_core.vectorstores import VectorStoreclass SelfQuerying:def __init__(self):# 我們需要同時開通 DASHSCOPE_API_KEY 和 DASHVECTOR_API_KEYos.environ["DASHSCOPE_API_KEY"] = ""os.environ["DASHVECTOR_API_KEY"] = ""os.environ["DASHVECTOR_ENDPOINT"] = ""self.llm = ChatTongyi(temperature=0)def handle_embeddings(self)->'VectorStore':docs = [Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},),Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},),Document(page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},),Document(page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},),Document(page_content="Toys come alive and have a blast doing so",metadata={"year": 1995, "genre": "animated"},),Document(page_content="Three men walk into the Zone, three men walk out of the Zone",metadata={"year": 1979,"director": "Andrei Tarkovsky","genre": "thriller","rating": 9.9,},),]# 指定向量數(shù)據(jù)庫中的 Collection namevectorstore = DashVector.from_documents(docs, DashScopeEmbeddings(), collection_name="langchain")return vectorstoredef build_querying_retriever(self, vectorstore: 'VectorStore', enable_limit: bool=False)->'SelfQueryRetriever':"""構造優(yōu)化檢索:param vectorstore: 向量數(shù)據(jù)庫:param enable_limit: 是否查詢 Top k:return:"""metadata_field_info = [AttributeInfo(name="genre",description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",type="string",),AttributeInfo(name="year",description="The year the movie was released",type="integer",),AttributeInfo(name="director",description="The name of the movie director",type="string",),AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),]document_content_description = "Brief summary of a movie"retriever = SelfQueryRetriever.from_llm(self.llm,vectorstore,document_content_description,metadata_field_info,enable_limit=enable_limit)return retrieverdef handle_query(self, query: str):"""返回優(yōu)化查詢后的檢索結果:param query::return:"""# 使用 LLM 優(yōu)化查詢向量,構造優(yōu)化后的檢索retriever = self.build_querying_retriever(self.handle_embeddings())response = retriever.invoke(query)return responseif __name__ == '__main__':q = SelfQuerying()# 只通過查詢屬性過濾print(q.handle_query("I want to watch a movie rated higher than 8.5"))# 通過查詢屬性和查詢語義內容過濾print(q.handle_query("Has Greta Gerwig directed any movies about women"))# 復雜過濾查詢print(q.handle_query("What's a highly rated (above 8.5) science fiction film?"))# 復雜語義和過濾查詢print(q.handle_query("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"))
上邊的代碼主要步驟有三步:
- 執(zhí)行 embedding, 將帶有 metadata 的 Doc 嵌入 DashVector
- 構造 self-querying retriever,需要預先提供一些關于我們的文檔支持的元數(shù)據(jù)字段的信息以及文檔內容的簡短描述。
- 執(zhí)行查詢語句
執(zhí)行代碼輸出查詢內容如下:
# "I want to watch a movie rated higher than 8.5"
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}),Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]# "Has Greta Gerwig directed any movies about women"
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]# "What's a highly rated (above 8.5) science fiction film?"
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]# "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]
總結
本文主要講了如何使用 langchain 的 self-query 來優(yōu)化向量檢索,我們使用的是阿里云的 DashVector 和 DashScope LLM 進行的代碼演示,讀者可以開通下,體驗嘗試一下。