Data Chunker Pro – Quick Start Guide
Getting Started with Data Chunker Pro: Your Quick Guide to Data Preparation
Data Chunker Pro is designed to make preparing your data for AI applications simpler than ever. While our tool is powerful, understanding the initial steps can make a big difference in getting the most out of your data. This guide will walk you through the core concepts and provide a quick reference for different use cases.
Your First Data Chunk:
A Simple Guide to Getting Started with Data Chunker Pro
Welcome to Data Chunker Pro! This guide will show you how to convert your first file or directory and get started with preparing your data for AI. Don’t worry, it’s easier than you think!
Step 0: Installation (Skip if Already Installed!)
- Download the Data Chunker Pro installer from [Link to Main Page].
- Run the installer and follow the on-screen instructions.
Step 1: Choosing Your File or Directory
- Launch Data Chunker Pro.
- Select either ‘Single File’ or ‘Directory Scan (Recursive)’, depending on which method you want.
- Single File will only select the file you choose.
- Directory Scan will scan all files inside of a directory and the child folders in that directory.
- Click the button called ‘Select Source’ and choose your source file/directory.
Step 2: Selecting Your Chunking Method & Size
- Choose a Chunking Method from the Available Options (complete list at bottom of page):
- Context-Aware – Automatically picks the best method for the file type.
- Self-Contained – Creates chunks that function independently, with all necessary metadata included.
- Single File – Exports a single file.
- Smart Size – Will automatically resize chunks to a specified size (e.g., 2KB).
- By CSV Columns – Chunk by columns.
- By Data Records – Separates by records (useful for structured data).
- By Function – Splits based on defined functions (ideal for code).
- By Region – Separates based on defined regions in the file.
- By Page Break – Splits based on page breaks (ideal for documents).
- By Section Header – Divides the file based on section headers.
- By Logical Block – Chunks based on naturally-occurring logical divisions.
- By Comment Group – Groups code chunks based on the comments included.
Step 3: Defining Your Export Settings
- Export Format: This is the format of the files Data Chunker Pro will create.
- Individual Text Files: Each chunk becomes a separate text file. Good for manual review.
- Combined JSON: Each chunk is a separate JSON file along proper indexing.
- Hybrid Export Text + JSON: Works for most platforms, and combines Text readability with JSON structure.
- Markdown: Perfect for integrating with AI tools.
- Include Headers: If you want every chunk to include headers at the top. Perfect for some databases, not great for mass AI implication.
- Include ID’s in names: Sometimes help with organization.
- Open After: Export Directory will automatically open after conversion is complete.
Step 4: Running the Conversion
- Once you’re happy with your settings, click the “Start Processing” button.
- Data Chunker Pro will now start processing your file/directory. You’ll see a progress indicator.
- When the process is complete, you’ll see a confirmation message with the total amount of files read, the total chunks created, and how many chunks you have left.
That’s it! You’ve successfully converted your first file/directory with Data Chunker Pro.
Pro Tip: Experiment with different settings to find the best chunking options for your specific use case.
Quick Start Settings Guide
For AI Code Analysis:
- Select Markdown export format
- Choose Context-Aware or By Token Count method
- Set chunk size to 1000 (tokens) or 75 (lines)
- Process your code
- Upload the
.mdfile to your AI tool
For Documentation:
- Select Markdown export format
- Choose By Documentation Block method
- Set any chunk size
- Process your code
- Use the generated documentation
For Custom Integration:
- Select Combined JSON + Text format
- Choose Smart Size (2KB) or By Token Count
- Set appropriate size
- Process your code
- Use JSON metadata for your custom tools
Using Data In Your LLM
To use the data in your LLM, you need to tell the LLM to look for the index so it knows how to sort the information.
Use the following prompt to help the AI know everything about your data!
You are an analysis assistant, along with any other directives.
When analyzing uploaded projects and documents:
1. ALWAYS look for and read the “index.json” file first. If “index.json” is not available, then look for “index.md” or “index.txt” instead. These files will contain the project structure and chunk metadata.
2. Use the index to understand the database, documents, or codebase organization before answering questions.
3. When referencing code, text, or documentation, mention the source file and chunk ID for easy lookup.
4. If asked about specific functionality, section, or text, search for relevant information in the index first.
5. The index contains chunk_id, source_file, file_name, and content_preview fields to help you locate relevant code.
When displaying retrieved content, always include:
– Source file path
– Function/method/title/name
– Chunk ID (if available)
If no “index.json”, “index.md” or “index.txt” is found, analyze the documents directly but mention that an index would improve analysis.
Quick Start Export Formats
Individual Text Files
What it does: Creates separate .txt files for each code chunk Best for:
- Manual review of individual code sections
- Importing into systems that need separate files
- Organizing code by specific functions or classes
- Traditional documentation workflows
Output: chunk_001.txt, chunk_002.txt, etc. in individual folders
Combined JSON
What it does: Creates structured JSON metadata Best for:
- Databases requiring JSON structured data
- Systems requiring structured metadata
- Automated processing
- Advanced search and indexing systems
Output:
ProjectName.json(structured metadata)Contents-of_ProjectName.json(all content combined)Contents_ProjectName/index.json(detailed index)
Hybrid Export JSON + Text
What it does: Creates structured JSON metadata + combined text file Best for:
- Developers building custom AI tools
- Systems requiring structured metadata
- API integrations and automated processing
- Advanced search and indexing systems
Output:
ProjectName.json(structured metadata)Contents-of_ProjectName.txt(all content combined)Contents_ProjectName/index.json(detailed index)
Markdown Format
What it does: Creates AI-optimized Markdown files with syntax highlighting Best for:
- Ollama + Open WebUI + RAG systems
- ChatGPT, Claude, and other AI assistants
- GitHub documentation
- Technical documentation and wikis
- Any AI-powered code analysis
Output:
Contents-of_ProjectName.md(beautifully formatted, AI-ready)Contents_ProjectName/index.md(clickable table of contents)- Proper code syntax highlighting for 20+ languages
Complete User Guide: AI Code Chunking & Export Tool
What This Tool Does
This application processes your source code files and breaks them into optimized chunks for AI analysis, documentation, and knowledge management. It’s designed to make your codebase AI-friendly for tools like ChatGPT, Claude, Ollama, and RAG (Retrieval Augmented Generation) systems.
Export Formats
Individual Text Files
What it does: Creates separate .txt files for each code chunk Best for:
- Manual review of individual code sections
- Importing into systems that need separate files
- Organizing code by specific functions or classes
- Traditional documentation workflows
Output: chunk_001.txt, chunk_002.txt, etc. in individual folders
Combined JSON + Text
What it does: Creates structured JSON metadata + combined text file Best for:
- Developers building custom AI tools
- Systems requiring structured metadata
- API integrations and automated processing
- Advanced search and indexing systems
Output:
ProjectName.json(structured metadata)Contents-of_ProjectName.txt(all content combined)Contents_ProjectName/index.json(detailed index)
Markdown Format – RECOMMENDED for AI
What it does: Creates AI-optimized Markdown files with syntax highlighting Best for:
- Ollama + Open WebUI + RAG systems
- ChatGPT, Claude, and other AI assistants
- GitHub documentation
- Technical documentation and wikis
- Any AI-powered code analysis
Output:
Contents-of_ProjectName.md(beautifully formatted, AI-ready)Contents_ProjectName/index.md(clickable table of contents)- Proper code syntax highlighting for 20+ languages
Chunking Methods
By Function/Method – RECOMMENDED FOR CODE
What it does: Splits code at function and method boundaries Best for:
- Analyzing individual functions
- Code review and optimization
- Understanding specific algorithms
- Function-level documentation
Use case: “I want to analyze each function separately for optimization”
By Class
What it does: Groups entire classes together, including all methods Best for:
- Object-oriented analysis
- Understanding class relationships
- Refactoring entire classes
- Architecture review
Use case: “I want to understand how each class works as a complete unit”
By Documentation Block
What it does: Keeps comments and documentation with their related code Best for:
- Maintaining context between docs and code
- Understanding developer intent
- Code explanation and teaching
- Preserving design decisions
Use case: “I want AI to understand not just what code does, but why it was written”
Smart Size (2KB)
What it does: Intelligently splits code at logical boundaries near the specified KB size Best for:
- Consistent chunk sizes for processing
- Memory-constrained systems
- Balanced content distribution
- General-purpose chunking
Use case: “I need consistent-sized chunks that respect code structure”
Context-Aware – RECOMMENDED FOR GENERAL USE
What it does: Includes necessary imports, namespaces, and class definitions in each chunk Best for:
- AI code analysis and suggestions
- Understanding dependencies
- Code modification assistance
- Complete context preservation
Use case: “I want AI to understand my code with full context for better suggestions”
By Region
What it does: Splits on #Region blocks (VB.NET) or similar organizational markers Best for:
- Well-organized codebases using regions
- Logical code organization
- Team-based development
- Maintaining developer-intended groupings
Use case: “My code uses regions to organize functionality”
By Token Count – RECOMMENDED FOR AI
What it does: Splits based on AI model token limits (optimal for AI processing) Best for:
- Ollama, ChatGPT, Claude integration
- RAG systems and vector databases
- AI model optimization
- Consistent AI processing
Use case: “I want perfect-sized chunks for my AI model’s token limits”
By Markdown Header
What it does: Splits Markdown files at header boundaries (#, ##, ###) Best for:
- Processing existing documentation
- README files and wikis
- Technical documentation
- Structured document analysis
Use case: “I have Markdown documentation I want to process”
Self-Contained
What it does: Ensures each chunk can be understood independently Best for:
- Independent code analysis
- Parallel processing
- Distributed AI analysis
- Standalone code review
Use case: “Each chunk needs to make sense without other chunks”
Single File
What it does: Processes entire files as single chunks (ignores size settings) Best for:
- Small files or scripts
- Complete file analysis
- Simple processing needs
- Configuration files
Use case: “I want to keep entire files together”
By Logical Block
What it does: Splits based on pre-defined logical blocks within a file (e.g., between chapters in a book, or sections in a report) Best for:
- Processing long documents with clear section boundaries.
- Maintaing document structure for human readability Use Case: “I have a large document divided into distinct sections.”
By Database Record
What it does: Splits files based on individual records within a database format (e.g., CSV, SQL dump). Best for:
- Processing large datasets stored in structured formats.
- Extracting specific records for analysis. Use Case: “I need to process data from a database in manageable chunks.”
Chunk Size Settings
Dynamic Label System
Our intelligent interface adapts to your chosen chunking method, providing contextually relevant controls:
- “Chunk Size” – For line-based methods (By Function, By Class, etc.) – Displays line count limits
- “Chunk Size (KB)” – For Smart Size method – Shows file size in kilobytes
- “Chunk Size (TK)” – For By Token Count method – Indicates token limits for AI processing
Recommended Sizes by Use Case
For AI Analysis (Token Count) – MOST POPULAR
- 500 tokens: Small, focused chunks ideal for specific functions, variable definitions, or targeted code snippets. Perfect for precise code search and minimal context requirements.
- 1000 tokens: ⭐ OPTIMAL – Complete methods with surrounding context. Provides the best balance of specificity and comprehensiveness for most AI applications.
- 1500 tokens: Larger chunks suitable for full classes, complex logic blocks, or comprehensive documentation sections. Ideal for understanding broader code relationships.
- 2000 tokens: Maximum recommended for most AI models. Use for complete modules or extensive documentation that must stay together.
- 2500+ tokens: Advanced use only – may exceed context windows of smaller AI models.
For File Size (KB) – BALANCED APPROACH
- 1-2 KB: Small, focused sections perfect for individual functions or specific code blocks. Fastest processing and most granular search results.
- 2-4 KB: ⭐ OPTIMAL – Balanced content size that typically contains complete functions with context, comments, and related code.
- 4-8 KB: Large sections ideal for complete classes, modules, or comprehensive documentation pages. Good for understanding system architecture.
- 8+ KB: Very large chunks for extensive files that should remain intact. Use sparingly as may impact AI processing speed.
- 16+ KB: Maximum recommended – only for critical files that must stay complete.
For Line Count – DEVELOPER FRIENDLY
- 25-50 lines: Small functions, getters/setters, utility methods. Perfect for microservice architectures and focused code analysis.
- 50-100 lines: ⭐ OPTIMAL – Complete methods with full logic, error handling, and documentation. Most natural for code review and analysis.
- 100-200 lines: Large methods, complete classes, or complex algorithms. Suitable for understanding comprehensive business logic.
- 200+ lines: Very large sections, complete modules, or legacy code blocks. Use when code structure requires keeping large sections together.
- 500+ lines: Maximum practical limit – consider refactoring code or using different chunking strategy.
Perfect Use Cases
🤖 AI Coding Assistants
- Train AI on your specific codebase and patterns
- Context-aware code suggestions and completions
- Intelligent code review and optimization
- Custom AI models for domain-specific development
📚 RAG Knowledge Bases
- Power Ollama, Open WebUI, and vector databases
- Searchable, AI-accessible code documentation
- Intelligent code search and retrieval
- Context-aware technical support systems
🏢 Enterprise Legacy Modernization
- Process COBOL, FORTRAN, and legacy systems
- Document undocumented legacy codebases
- Prepare legacy code for AI-assisted migration
- Create searchable knowledge bases from old systems
🔬 LLM Fine-Tuning & Research
- Prepare domain-specific training data from your projects
- Create high-quality code datasets for model training
- Research code patterns and architectural decisions
- Academic research on software engineering
📖 Technical Knowledge Management
- Convert repositories into intelligent, searchable systems
- Onboard new team members with AI-powered code exploration
- Create comprehensive technical documentation automatically
- Build institutional knowledge preservation systems
🔍 Code Analysis & Review
- AI-powered code understanding and optimization
- Automated code quality assessment
- Security vulnerability analysis preparation
- Architecture and design pattern analysis
🎓 Education & Training
- Create AI-powered coding tutorials and examples
- Build interactive code learning systems
- Prepare code examples for educational AI assistants
- Research and academic code analysis
Complete File Type Support List
Here you will find a complete list of all 829 currently supported formats~
Why Chunk Size Matters
The size of your chunks directly impacts AI performance and retrieval accuracy. Smaller chunks provide precise, focused content but may lack context. Larger chunks offer comprehensive information but can overwhelm AI models or exceed context windows. Our recommendations are based on extensive testing with popular AI models and RAG systems.
Advanced Chunking Strategies
Hybrid Approach (Enterprise Feature)
Combine multiple chunking methods for optimal results:
- Use Function-based chunking for code files
- Use Token-based chunking for documentation
- Use Size-based chunking for data files
Context Preservation
Our chunking engine automatically:
- Preserves function signatures across chunks
- Maintains class hierarchy information
- Includes relevant imports and dependencies
- Adds contextual headers for AI understanding
AI Model Compatibility
Different AI models have varying optimal chunk sizes:
- GPT-3.5/4: 1000-1500 tokens optimal
- Claude: 1500-2000 tokens optimal
- Local models (7B-13B): 500-1000 tokens optimal
- Larger models (70B+): 2000+ tokens acceptable
Performance Impact
Processing Speed by Chunk Size:
- Small chunks (500 tokens/1KB/50 lines): Fastest processing, most granular results
- Medium chunks (1000 tokens/2-4KB/100 lines): Balanced speed and comprehensiveness
- Large chunks (2000+ tokens/8KB+/200+ lines): Slower processing, comprehensive context
Storage Efficiency:
- Smaller chunks = More files, better organization, faster searches
- Larger chunks = Fewer files, less overhead, complete context
Best Practices
- Start with recommended optimal sizes for your use case
- Test with your specific AI model to find the sweet spot
- Consider your content type – code vs documentation vs data
- Monitor AI response quality and adjust accordingly
- Use consistent sizing within the same project for predictable results
Expert Tips
- For code analysis: Function-based chunking with 1000 token limit works best 90% of the time
- For documentation: Size-based chunking at 2-4KB preserves formatting and context
- For mixed projects: Start with token-based at 1000 tokens, then fine-tune based on results
- For enterprise use: Consider multiple exports with different chunk sizes for different use cases
Chunking Methods Deep Dive
Context-Aware Chunking ⭐ RECOMMENDED FOR AI
What it does: Includes necessary imports, namespaces, and class definitions in each chunk
Best for: AI code analysis, code suggestions, modification assistance, complete context preservation
Use case: “I want AI to understand my code with full context for better suggestions”
Output: Each chunk contains all dependencies needed for AI to understand the code independently
Token Count Chunking ⭐ RECOMMENDED FOR AI
What it does: Splits based on AI model token limits (optimal for AI processing)
Best for: Ollama, ChatGPT, Claude integration, RAG systems, vector databases, AI model optimization
Token Ranges:
- 500 tokens: Small, focused chunks for specific functions and targeted analysis
- 1000 tokens: ⭐ OPTIMAL – Complete methods with context, best balance for most AI applications
- 1500 tokens: Larger chunks for full classes and complex logic blocks
- 2000 tokens: Maximum for most AI models, complete modules and extensive documentation
- 2500+ tokens: Advanced use only, may exceed smaller AI model limits
Use case: “I want perfect-sized chunks for my AI model’s token limits”
Function/Method Level Chunking
What it does: Splits code at function and method boundaries
Best for: Analyzing individual functions, code review, optimization, algorithm understanding
Output: Each chunk contains one complete function with its documentation
Use case: “I want to analyze each function separately for optimization”
Class-Based Chunking
What it does: Groups entire classes together, including all methods and properties
Best for: Object-oriented analysis, understanding class relationships, refactoring, architecture review
Output: Complete class definitions with all members and methods
Use case: “I want to understand how each class works as a complete unit”
Documentation Block Chunking
What it does: Keeps comments, docstrings, and documentation with their related code
Best for: Maintaining context between docs and code, understanding developer intent, code explanation, preserving design decisions
Output: Code sections with their associated documentation and comments
Use case: “I want AI to understand not just what code does, but why it was written”
Smart Size Chunking
What it does: Intelligently splits content at logical boundaries near specified KB size
Size Ranges:
- 1-2 KB: Small, focused sections for individual functions
- 2-4 KB: ⭐ OPTIMAL – Balanced content with complete functions and context
- 4-8 KB: Large sections for complete classes and comprehensive documentation
- 8+ KB: Very large chunks for extensive files that should remain intact
Best for: Consistent chunk sizes, memory-constrained systems, balanced content distribution
Use case: “I need consistent-sized chunks that respect code structure”
Region-Based Chunking
What it does: Splits on #Region blocks (VB.NET), #pragma region (C++), or similar organizational markers
Best for: Well-organized codebases using regions, logical code organization, team-based development
Output: Chunks based on developer-defined logical groupings
Use case: “My code uses regions to organize functionality”
Self-Contained Chunking
What it does: Ensures each chunk can be understood independently with necessary context
Best for: Independent code analysis, parallel processing, distributed AI analysis, standalone code review
Output: Chunks that make complete sense without requiring other chunks
Use case: “Each chunk needs to make sense without other chunks”
Markdown Header Chunking
What it does: Splits Markdown files at header boundaries (#, ##, ###)
Best for: Processing existing documentation, README files, wikis, technical documentation, structured document analysis
Output: Document sections organized by header hierarchy
Use case: “I have Markdown documentation I want to process”
Single File Chunking
What it does: Processes entire files as single chunks (ignores size settings)
Best for: Small files, scripts, complete file analysis, simple processing needs, configuration files
Output: One chunk per file, regardless of file size
Use case: “I want to keep entire files together”
Line Count Chunking
What it does: Splits based on number of lines with intelligent boundary detection
Line Ranges:
- 25-50 lines: Small functions, utility methods, microservice architectures
- 50-100 lines: ⭐ OPTIMAL – Complete methods with full logic and documentation
- 100-200 lines: Large methods, complete classes, complex algorithms
- 200+ lines: Very large sections, complete modules, legacy code blocks
Best for: Developer-friendly chunking, code review, maintaining readable chunk sizes
Use case: “I want chunks that feel natural to developers”
Optimal Chunking Strategies by Use Case
🤖 AI Code Assistant Setup (MOST POPULAR)
- Export Format: Markdown
- Chunking Method: Context-Aware OR Token Count
- Chunk Size: 1000 tokens OR 75-100 lines
- Perfect for: Ollama, ChatGPT, Claude, Open WebUI RAG
- Why it works: Provides complete context while staying within AI model limits
📚 Code Documentation Generation
- Export Format: Markdown
- Chunking Method: Documentation Block
- Chunk Size: Any size (preserves natural documentation boundaries)
- Perfect for: Technical docs, code explanation, wikis, team knowledge bases
- Why it works: Maintains the relationship between code and its documentation
🔍 Code Review & Analysis
- Export Format: Individual Text Files OR Markdown
- Chunking Method: Function Level OR Class-Based
- Chunk Size: 50-100 lines
- Perfect for: Manual review, optimization, refactoring, quality assessment
- Why it works: Provides focused, reviewable code sections
⚙️ API Integration & Custom Tools
- Export Format: Combined JSON + Text
- Chunking Method: Smart Size (2KB) OR Token Count
- Chunk Size: 1000 tokens OR 2KB
- Perfect for: Custom tools, automated processing, integration with existing systems
- Why it works: Structured metadata enables programmatic processing
🎓 Learning & Teaching
- Export Format: Markdown
- Chunking Method: Self-Contained OR Documentation Block
- Chunk Size: 75-150 lines
- Perfect for: Code education, examples, tutorials, training materials
- Why it works: Each chunk is understandable independently
🏗️ Architecture Analysis
- Export Format: Markdown
- Chunking Method: Class-Based OR Context-Aware
- Chunk Size: 100-200 lines
- Perfect for: System design, refactoring, architectural planning
- Why it works: Preserves complete architectural units and their relationships
🔬 Research & Data Science
- Export Format: JSON + Text
- Chunking Method: Token Count
- Chunk Size: 500-1000 tokens
- Perfect for: Code analysis research, pattern recognition, academic studies
- Why it works: Consistent, measurable chunks with rich metadata
🏢 Enterprise Legacy Processing
- Export Format: All formats (comprehensive coverage)
- Chunking Method: Smart Size OR Documentation Block
- Chunk Size: 2-4KB OR preserve natural boundaries
- Perfect for: Legacy system documentation, modernization planning
- Why it works: Handles diverse legacy formats while preserving business logic context
Tested AI Model Compatibility Guide
Large Language Models
OpenAI GPT Models
- GPT-3.5-Turbo: 1000-1500 tokens optimal, 4096 token limit
- GPT-4: 1500-2000 tokens optimal, 8192 token limit
- GPT-4-Turbo: 2000-4000 tokens optimal, 128k token limit
- Recommended Settings: Token Count chunking, Context-Aware method
Anthropic Claude
- Claude-1: 1500-2000 tokens optimal, 9k token limit
- Claude-2: 2000-3000 tokens optimal, 100k token limit
- Claude-3: 2000-4000 tokens optimal, 200k token limit
- Recommended Settings: Token Count chunking, Documentation Block method
Local Models (Ollama)
- 7B Models (CodeLlama, Mistral, Gemma3): 500-1000 tokens optimal
- 13B Models (Devstral, Gemma3): 1000-1500 tokens optimal
- 34B+ Models: 1500-2000 tokens optimal
- Recommended Settings: Token Count chunking, Self-Contained method
RAG Systems
Vector Databases
- Pinecone: 1000-2000 tokens per chunk, JSON export recommended
- Weaviate: 500-1500 tokens per chunk, Markdown export optimal
- Chroma: 1000-2000 tokens per chunk, flexible format support
- Qdrant: 500-2000 tokens per chunk,
This tool transforms your codebase into AI-ready, well-organized chunks perfect for modern development workflows, AI assistance, and comprehensive code analysis.


Previous Post
















