Custom Chunker#
Please follow these instructions to create a custom chunker.
Create folder structure for your custom chunker#
Place your custom chunker .py file in the rag_colls/processors/chunkers directory.
Let’s say you want to create a custom chunker called MyChunker, you would create a file named my_chunker.py in the rag_colls/processors/chunkers directory.
The file structure should look like this:
rag_colls/
├── processors/
│ ├── chunkers/
│ │ ├── my_chunker.py
│ │ └── ...
│ └── ...
└── ...
Implement your custom chunker#
Your custom chunker must inherit from the BaseChunker class. Here’s the code for BaseChunker:
from abc import ABC, abstractmethod
from rag_colls.types.core.document import Document
class BaseChunker(ABC):
@abstractmethod
def _chunk(self, documents: list[Document], **kwargs) -> list[Document]:
"""
Chunk the documents.
Args:
documents (list[Document]): List of documents to be chunked.
`kwargs: Additional keyword arguments for the chunking function.
Returns:
list[Document]: List of chunked documents.
"""
raise NotImplementedError("This method should be overridden by subclasses.")
@abstractmethod
async def _achunk(self, documents: list[Document], **kwargs) -> list[Document]:
"""
Asynchronously chunk the documents.
Args:
documents (list[Document]): List of documents to be chunked.
`kwargs: Additional keyword arguments for the chunking function.
Returns:
list[Document]: List of chunked documents.
"""
raise NotImplementedError("This method should be overridden by subclasses.")
You must implement _chunk and _achunk. With _achunk method, you can call _chunk asynchronously using asyncio.to_thread.
Example: MyChunker#
import asyncio
from rag_colls.core.base.chunkers.base import BaseChunker
from rag_colls.types.core.document import Document
class MyChunker(BaseChunker):
def _chunk(self, documents: list[Document], **kwargs) -> list[Document]:
# Implement your chunking logic here
chunked_documents = []
for doc in documents:
# Example chunking logic
chunks = [doc.document[i:i + 100] for i in range(0, len(doc.document), 100)]
for chunk in chunks:
chunked_documents.append(Document(document=chunk, metadata=doc.metadata))
return chunked_documents
async def _achunk(
self, documents: list[Document], **kwargs
):
return await asyncio.to_thread(self._chunk, documents, **kwargs)
Usage#
You can use your custom chunker like any built-in chunker:
from rag_colls.types.core.document import Document
from rag_colls.processors.chunkers.my_chunker import MyChunker
chunker = MyChunker()
documents = [Document(document="This is a long document that needs to be chunked.")]
chunked_documents = chunker.chunk(documents)
print(chunked_documents)
Or use it while initializing a RAG instance:
from rag_colls.rags.basic_rag import BasicRAG
from rag_colls.processors.chunkers.my_chunker import MyChunker
rag = BasicRAG(
...,
chunker=MyChunker(),
...
)
Create a test for your custom chunker#
Remember to create test case for your custom chunker. You can refer to tests/chunkers/test_semantic_chunker.py for more information.
In tests/chunkers directory, create a file named test_my_chunker.py and implement your test case.
from rag_colls.types.core.document import Document
def test_my_chunker():
"""
Test the custom chunker.
"""
from rag_colls.processors.chunkers.my_chunker import MyChunker
chunker = MyChunker()
documents = [Document(document="This is a long document that needs to be chunked.")]
chunked_documents = chunker.chunk(documents)
assert len(chunked_documents) > 0, "No chunked documents found"
first_chunk = chunked_documents[0]
assert hasattr(first_chunk, "document"), "Chunk does not have document attribute."
assert hasattr(first_chunk, "metadata"), "Chunk does not have metadata attribute."
Add to the documentation (Optional)#
Update later.