Custom Reader#
Please follow these instructions to create a custom reader.
Create folder structure for your custom reader#
Place your custom reader .py file in the rag_colls/processors/readers directory.
Let’s say you want to create a custom reader called MyReader which is used to read <ext> files.
You would create a file named my_reader.py in the rag_colls/processors/readers/<ext>/ directory.
The file structure should look like this:
rag_colls/
├── processors/
│ ├── readers/
│ │ ├── <ext>/
│ │ │ ├── __init__.py
│ │ │ ├── my_reader.py
│ │ │ └── ...
│ │ └── ...
│ └── ...
└── ...
For example: PyMuPDFReader is used to read .pdf files, so the file structure would look like this:
rag_colls/
├── processors/
│ ├── readers/
│ │ ├── pdf/
│ │ │ ├── __init__.py
│ │ │ ├── pymupdf_reader.py
│ │ │ └── ...
│ │ └── ...
│ └── ...
└── ...
Implement your custom reader#
In your custom reader file, you need to create a class that inherits from the BaseReader class.
Here’s the code for BaseReader:
from abc import ABC, abstractmethod
from rag_colls.types.core.document import Document
class BaseReader(ABC):
@abstractmethod
def _load_data(
self,
file_path: str | Path,
should_split: bool = True,
extra_info: dict | None = None,
) -> list[Document]:
"""
Loads data from the specified file path and returns a list of Document objects.
Args:
file_path (str | Path): The path to the file to be loaded.
should_split (bool): Whether to split the data into smaller chunks.
extra_info (dict | None): Additional information to be passed to the loader.
Returns:
list[Document]: A list of Document objects.
"""
raise NotImplementedError("This method should be overridden by subclasses.")
Note: You must add
should_splitandextra_infointometadataof eachDocumentobject.
Example: MyCustomTxtReader#
Here is an example of a custom reader that reads .txt files and splits the content into smaller chunks.
First, create a directory for your custom reader if it doesn’t exist. The directory structure should look like this:
rag_colls/
├── processors/
│ ├── readers/
│ │ ├── txt/
│ │ │ ├── __init__.py
│ │ │ ├── my_custom_txt_reader.py
│ │ │ └── ...
│ │ └── ...
│ └── ...
└── ...
Then, create a file named my_custom_txt_reader.py in the rag_colls/processors/readers/txt/ directory.
In this file, you can implement your custom reader class like this:
from pathlib import Path
from rag_colls.core.base.readers.base import BaseReader
from rag_colls.types.core.document import Document
class MyCustomTxtReader(BaseReader):
def _load_data(
self,
file_path: str | Path,
should_split: bool = True,
extra_info: dict | None = None,
) -> list[Document]:
"""
Loads data from .txt file and return its documents.
Args:
file_path (str | Path): The path to the file to be loaded.
should_split (bool): Whether to split the data into smaller chunks.
extra_info (dict | None): Additional information to be passed to the loader.
Returns:
list[Document]: A list of Document objects.
"""
documents = []
with open(file_path, "r") as file:
content = file.read()
chunks = content.split("\n\n") # Example: split by double newlines
for chunk in chunks:
documents.append(Document(
document=chunk,
metadata={"should_split": should_split, **(extra_info or {})}
))
return documents
Finally, add it in rag_colls/processors/readers/txt/__init__.py file:
...
from .my_custom_txt_reader import MyCustomTxtReader
__all__ = [..., "MyCustomTxtReader"]
Usage#
You can use your custom reader in the same way as the built-in readers.
from rag_colls.processors.readers.txt import MyCustomTxtReader
reader = MyCustomTxtReader()
documents = reader.load_data(file_path="path/to/your/file.txt")
for doc in documents:
print(doc.document)
print(doc.metadata)
Create a test for your custom reader#
Remember to create test case for your custom reader. You can refer to tests/readers/test_pdf_reader.py for more information.
In tests/readers directory, create a file named test_txt_reader.py and implement your test case.
from rag_colls.processors.readers.txt import MyCustomTxtReader
def test_custom_txt_reader():
reader = MyCustomTxtReader()
documents = reader.load_data(file_path="samples/data/test.txt")
assert len(documents) > 0, "No documents found"
first_document = documents[0]
assert hasattr(first_document, "document"), "Missing `document` attribute"
assert hasattr(first_document, "metadata"), "Missing `metadata` attribute"
Register as default reader (Optional)#
In case you want to add your custom reader to the default readers list, you can do so by modifying the rag_colls/processors/file_processor.py file.
Find the _get_default_processors method in the FileProcessor class and add your custom reader to it.
class FileProcessor:
...
def _get_default_processors(self) -> dict[str, BaseReader]:
"""
Initialize default file processors.
Returns:
dict[str, BaseReader]: A dictionary of default file processors.
"""
from .readers.txt import MyCustomTxtReader
return {
...
".txt": MyCustomTxtReader(),
...
}
Add to the documentation (Optional)#
Update later.