vector-text-fixer

# Vector Text Fixer Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools. ## Features - **Garbled Text Detection**: Automatically identifies garbled text in PDF/SVG files - **Smart Repair**: Infers original text content based on context - **Batch Processing**: Supports batch processing of multiple files in a folder - **Format Preservation**: Repaired files maintain original vector format and layout - **AI-assisted Editing**: Outputs intermediate format that can be imported into AI editors ## Supported Scenarios ### 1. PDF Garbled Text Repair - Box/question mark issues caused by font embedding problems - Garbled text caused by encoding conversion errors - Abnormal characters generated by missing font substitution - Multi-language mixed encoding issues ### 2. SVG Garbled Text Repair - Text entity encoding errors - Special character escaping issues - Display abnormalities caused by invalid font references - XML encoding declaration errors ## Usage ### Command Line ```bash # Fix a single PDF file python scripts/main.py --input document.pdf --output fixed.pdf # Fix a single SVG file python scripts/main.py --input diagram.svg --output fixed.svg # Batch process folder python scripts/main.py --batch ./input_folder --output ./output_folder # Interactive repair (manually specify replacement content) python scripts/main.py --input doc.pdf --interactive # Export as editable format (JSON) python scripts/main.py --input doc.pdf --export-json editable.json ``` ### Python API ```python from scripts.main import VectorTextFixer # Create fixer instance fixer = VectorTextFixer() # Fix PDF result = fixer.fix_pdf("input.pdf", "output.pdf") # Fix SVG result = fixer.fix_svg("input.svg", "output.svg") # Batch processing results = fixer.batch_fix("./input_folder", "./output_folder") # Get text map (for AI editing) text_map = fixer.extract_text_map("input.pdf") ``` ## Input Parameters | Parameter | Type | Required | Description | |------|------|------|------| | `--input` | str | Yes* | Input file path (PDF or SVG) | | `--batch` | str | No | Batch processing input folder | | `--output` | str | Yes* | Output file/folder path | | `--interactive` | bool | No | Enable interactive repair mode | | `--export-json` | str | No | Export editable JSON format | | `--encoding` | str | No | Specify source file encoding (default: auto-detect) | | `--font-substitution` | dict | No | Font replacement mapping | | `--repair-level` | str | No | Repair level: minimal, standard, aggressive (default: standard) | *At least one of --input and --batch is required ## Output Format ### Repaired PDF/SVG - Maintains original vector format - Garbled text replaced with readable content - Fonts and layout remain unchanged ### JSON Export Format ```json { "file_type": "pdf", "pages": [ { "page_num": 1, "text_blocks": [ { "id": "tb_001", "bbox": [100, 200, 300, 220], "original_text": "��", "detected_encoding": "UTF-8", "confidence": 0.3, "suggested_fix": "Sample Text" } ] } ], "fonts_used": ["Arial", "SimSun"], "repair_summary": { "total_blocks": 15, "fixed_blocks": 12, "skipped_blocks": 3 } } ``` ## Garbled Text Detection Rules The tool uses the following rules to detect garbled text: 1. **Replacement Character Detection**: Identifies U+FFFD (�) and box characters 2. **Control Character Filtering**: Excludes non-printing control characters 3. **Encoding Consistency**: Detects anomalies caused by mixed encodings 4. **Font Fallback Detection**: Identifies substitution characters generated due to missing fonts 5. **Probability Model**: Garbled text probability assessment based on character frequency ## Repair Strategies ### Minimal - Only repairs obvious errors (replacement characters, null bytes) - Maintains maximum integrity of original text - Suitable for minor garbled text issues ### Standard - Repairs common encoding issues - Smart font replacement - Balances repair rate and accuracy ### Aggressive - Comprehensive text re-encoding - Uses OCR-assisted recognition - Suitable for severely garbled documents ## Examples ### Fix Single Page PDF **Input**: ```bash python scripts/main.py --input report.pdf --output fixed_report.pdf ``` **Output**: ``` ✓ Processing: report.pdf ✓ Detected 5 garbled text blocks ✓ Fixed 4 blocks automatically ⚠ 1 block requires manual review ✓ Output saved: fixed_report.pdf ✓ Report saved: fixed_report_repair_log.json ``` ### Export Editable JSON **Input**: ```bash python scripts/main.py --input diagram.svg --export-json editable.json ``` **Output JSON Structure**: ```json { "file_type": "svg", "svg_info": { "width": 800, "height": 600, "viewBox": "0 0 800 600" }, "text_elements": [ { "id": "text_1", "x": 100, "y": 200, "font_family": "Arial", "font_size": 14, "original": "��", "user_editable": "", "confidence": 0.25 } ] } ``` ## Dependencies ``` pdfplumber>=0.10.0 # PDF parsing PyMuPDF>=1.23.0 # PDF processing (fitz) cairosvg>=2.7.0 # SVG conversion beautifulsoup4>=4.12.0 # SVG parsing fonttools>=4.40.0 # Font processing chardet>=5.0.0 # Encoding detection Pillow>=10.0.0 # Image processing ``` ## Limitations - Encrypted PDFs require password unlock before processing - Severely damaged vector files may not be fully repairable - Some rare fonts may not map correctly - Scanned PDFs require OCR recognition first ## Version Information - **Version**: 1.0.0 - **Last Updated**: 2026-02-06 - **Status**: Ready for use ## Risk Assessment | Risk Indicator | Assessment | Level | |----------------|------------|-------| | Code Execution | Python/R scripts executed locally | Medium | | Network Access | No external API calls | Low | | File System Access | Read input files, write output files | Medium | | Instruction Tampering | Standard prompt guidelines | Low | | Data Exposure | Output files saved to workspace | Low | ## Security Checklist - [ ] No hardcoded credentials or API keys - [ ] No unauthorized file system access (../) - [ ] Output does not expose sensitive information - [ ] Prompt injection protections in place - [ ] Input file paths validated (no ../ traversal) - [ ] Output directory restricted to workspace - [ ] Script execution in sandboxed environment - [ ] Error messages sanitized (no stack traces exposed) - [ ] Dependencies audited ## Prerequisites ```bash # Python dependencies pip install -r requirements.txt ``` ## Evaluation Criteria ### Success Metrics - [ ] Successfully executes main functionality - [ ] Output meets quality standards - [ ] Handles edge cases gracefully - [ ] Performance is acceptable ### Test Cases 1. **Basic Functionality**: Standard input → Expected output 2. **Edge Case**: Invalid input → Graceful error handling 3. **Performance**: Large dataset → Acceptable processing time ## Lifecycle Status - **Current Stage**: Draft - **Next Review Date**: 2026-03-06 - **Known Issues**: None - **Planned Improvements**: - Performance optimization - Additional feature support

vector-text-fixer

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

vector-text-fixer