浅谈 RAG 以及 GraphRAG | 湛蓝与蔚蓝

status

type

date

slug

summary

预处理

分成chunk 每个做embedding 还是文本形式储存吗？

构建图

通过llm 提取 Entity, Relation, Claims

文档中事件发展时间顺序 pos emb 是否会记录?

构建Community

通过聚类整理集中的实体点

也可以通过关系relationship(edge) 来整理

通过使聚类内部edge多不同聚类之间edge少来构建

之后还要做的事

查询模式

1. 全局查询

文章主题总结 —> 需要构建宏观的视角和信息

Map + Reduce 每个文本块(以社区为单位? 不同层级(分别捕捉不同层次信息))作比较

生成对应数值评级

再reduce到最重要的几点

2. 本地查询

问题中提取实体用这些实体在社区报告实体关系… 中搜索

生成知识图谱 parquet → csv → Neo4j 可视化

上边这些讲的比较抽象现在浅看一套实际的实施过程

使用工具: LlamaIndex,

完整流程图:

建立 Ingestion pipeline:

Creating nodes(ensure that semantically related segments of text remain together ???)

from llama_index.core import Document from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

nodes = node_parser.get_nodes_from_documents( [Document(text="long text")], show_progress=False )

from llama_index.core import SimpleDirectoryReader from llama_index.core.ingestion import IngestionPipeline from llama_index.core.node_parser import TokenTextSplitter

documents = SimpleDirectoryReader("./data").load_data()

pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])

nodes = pipeline.run(documents=documents)

Storage and create emb:

‣

vector store → storage context → index → index.as_query_engine()

= query_engine

response = query_engine.query("What did the author do growing up?")

‣

from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
model_name="llama2",
base_url="

http://localhost:11434

",
ollama_additional_kwargs={"mirostat": 0},
)

pass_embedding = ollama_embedding.get_text_embedding_batch(
["This is a passage!", "This is another passage"], show_progress=True
)
print(pass_embedding)

query_embedding = ollama_embedding.get_query_embedding("Where is blue?")
print(query_embedding)

参考来源：

https://juejin.cn/post/7392115478561325083

https://juejin.cn/post/7362173600344801321