Category: Python

  • A Simple Implementation of the ‘Talk to Your Document’ Principle with AI using OpenAI & Python

    AI is changing how we work with information. With “talk to your documents,” files like PDFs or spreadsheets are no longer static — you can ask them questions directly and get instant, meaningful answers, instead of wasting time searching or scrolling.

    Why “Talk to Your Document”?

    • Save time: No more scanning through 50-page reports to find a single answer.
    • Make knowledge accessible: Anyone in your team can query technical or legal documents without being an expert.
    • Improve productivity: Teams spend less time searching and more time acting.

    From HR policies and financial reports to research papers and contracts, the principle applies everywhere.

    How It Works

    At a high level, the process is straightforward:

    1. Upload Documents – PDFs, Word files, or spreadsheets are ingested into the system.
    2. Chunking – The text is split into manageable sections (e.g., 500–1000 characters) so that AI can handle context effectively.
    3. Embedding & Indexing – Each chunk is converted into a vector (a numerical representation of meaning) using tools like SentenceTransformers. These vectors are stored in a search index such as FAISS.
    4. User Query – When a user asks a question, the query is also converted into a vector and matched with the most relevant chunks.
    5. AI Response – A language model uses the retrieved chunks to generate an accurate and conversational answer.

    Simple Example using OpenAI & Python

    import fitz  # PyMuPDF for PDFs
    from sentence_transformers import SentenceTransformer
    import faiss
    
    # Load PDF and extract text
    doc = fitz.open("document.pdf")
    text = " ".join([page.get_text() for page in doc])
    
    # Split into chunks
    chunks = [text[i:i+500] for i in range(0, len(text), 500)]
    
    # Create embeddings
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(chunks)
    
    # Build FAISS index
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    
    # User query
    query = "What are the key benefits in this document?"
    query_embedding = model.encode([query])
    D, I = index.search(query_embedding, k=1)
    
    print("Answer:", chunks[I[0][0]])
    

    This script is very simplified, but it demonstrates the talk-to-your-document principle:

    • Load a file
    • Create vector representations
    • Retrieve the most relevant text chunk
    • Return the answer

    We may try a better implementation through a simple Python implementation using:

    • LangChain (to load, split, and search documents)
    • FAISS (for efficient vector storage & retrieval)
    • OpenAI GPT models (to generate conversational answers)

    How It Works

    1. Load the document (PDF or TXT).
    2. Split the text into chunks to make it manageable for embeddings.
    3. Generate embeddings using OpenAI models.
    4. Store embeddings in FAISS, a fast vector database.
    5. Retrieve the most relevant chunks based on the user’s question.
    6. Pass the context + question to an LLM (e.g., GPT-4o-mini).
    7. Get a natural language answer as if the document were talking back.

    Project Setup

    First, install the dependencies:

    pip install langchain langchain-community langchain-openai faiss-cpu python-dotenv PyMuPDF pandas

    And create a .env file for your API key and model:

    OPENAI_API_KEY=your_openai_api_key_here
    OPENAI_MODEL=gpt-4o-mini

    Example Code

    Here’s a complete script you can try:

    # talk_to_doc.py
    
    import os
    from dotenv import load_dotenv
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import PyPDFLoader, TextLoader
    from langchain_community.vectorstores import FAISS
    from langchain_openai import OpenAIEmbeddings, ChatOpenAI
    
    # Load environment variables
    load_dotenv()
    OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
    
    # ---- Load Document ----
    def load_document(path: str):
        if path.endswith(".pdf"):
            loader = PyPDFLoader(path)
        else:
            loader = TextLoader(path, encoding="utf-8")
        return loader.load()
    
    # ---- Build Vector Store ----
    def build_vectorstore(docs):
        splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
        chunks = splitter.split_documents(docs)
        embeddings = OpenAIEmbeddings()
        return FAISS.from_documents(chunks, embeddings)
    
    # ---- Ask questions ----
    def query_vectorstore(vectorstore, question: str):
        retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})
        llm = ChatOpenAI(model=OPENAI_MODEL, temperature=0)
        docs = retriever.get_relevant_documents(question)
        context = "\n\n".join([d.page_content for d in docs])
        prompt = f"Answer the question based only on the context:\n\n{context}\n\nQuestion: {question}"
        return llm.predict(prompt)
    
    if __name__ == "__main__":
        # Example usage
        path = "docs/example.pdf"  # or "example.txt"
        docs = load_document(path)
        vectorstore = build_vectorstore(docs)
    
        print("Chat with your document! (type 'exit' to quit)")
        while True:
            q = input("\nYour question: ")
            if q.lower() in ["exit", "quit"]:
                break
            answer = query_vectorstore(vectorstore, q)
            print(f"\nAI: {answer}")

    Example Run

    Suppose you have a PDF of your company’s HR policies. Running the script:

    python talk_to_doc.py

    You can now ask:

    python talk_to_doc.py
    Chat with your document! (type 'exit' to quit)
    
    Your question: What is this document about?
    AI: The document discusses the development and features of an open-source document conversion tool called Docling, which focuses on ensuring that documents are free to use. It highlights the sources of data used for the tool, the challenges of annotating scanned documents, and the preparation work involved in using a cloud-native platform for visual annotation. Additionally, it mentions the gap in the market for open-source tools compared to commercial software for document understanding and conversion, emphasizing the capabilities of Docling in layout analysis and table structure recognition.

    Conclusion

    The “talk to your document” principle is about making information conversational and accessible. With just a few open-source libraries and a language model, you can transform static files into interactive knowledge companions.

  • Refactoring a Monolithic Django Application – Before/After and Performance Gains

    Refactoring a monolithic Django application can significantly improve maintainability, scalability, and performance. This article explores the before and after of such a refactor, the strategies used, and the measurable gains in performance.

    Why Refactor a Monolithic Django App?

    • Maintainability: As the codebase grows, a monolith can become difficult to maintain.
    • Performance: Tight coupling between modules may lead to slow responses and high memory usage.
    • Scalability: Monolithic apps are harder to scale horizontally compared to microservices.
    • Agility: Introducing new features is slower due to interdependencies.

    Common Challenges in Monolithic Django Applications

    • Tightly Coupled Code: Models, views, and templates are heavily interdependent.
    • Single Database Bottleneck: All modules access the same database schema, leading to contention.
    • Long Build and Deployment Times: Even minor changes require redeploying the entire application.
    • Testing Difficulties: Running tests can be slow and complex due to the large codebase.

    Refactoring Strategy

    Modularization

    • Split the monolith into reusable Django apps with clearly defined responsibilities.
    • Example: Separate users, orders, and products apps.

    Decouple Services

    • Move non-critical or resource-intensive features into separate services or microservices.
    • Use Django REST Framework (DRF) to expose APIs for inter-service communication.

    Optimize Database Access

    • Use Django ORM efficiently: reduce N+1 queries, leverage select_related and prefetch_related.
    • Introduce caching for frequently accessed data with Redis or Memcached.
    • Consider read replicas for high-traffic tables.

    Asynchronous Tasks

    • Offload heavy operations to background tasks using Celery or Django-Q.
    • Examples: sending emails, processing images, generating reports.

    Frontend Optimization

    • Minimize server-side rendering for static content.
    • Use client-side frameworks or React for interactive components.

    Before/After Comparison

    AspectBeforeAfter
    Response TimeAvg. 1.2sAvg. 0.5s
    Database Queries per Page4512
    CPU UsageHigh under loadModerate
    Deployment Time15 min4 min
    Test Suite Duration45 min15 min

    Lessons Learned

    • Incremental Refactoring: Avoid a complete rewrite. Refactor in stages to reduce risk.
    • Monitoring is Key: Use metrics (CPU, memory, response time) to measure performance gains.
    • Automated Testing: Ensure all refactored components are thoroughly tested.
    • Team Collaboration: Maintain clear documentation and consistent coding standards.
    • Use Modern Django Features: Leverage async views, QuerySet optimizations, and built-in caching mechanisms.