vector-text-fixer
# Vector Text Fixer
Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools.
## Features
- **Garbled Text Detection**: Automatically identifies garbled text in PDF/SVG files
- **Smart Repair**: Infers original text content based on context
- **Batch Processing**: Supports batch processing of multiple files in a folder
- **Format Preservation**: Repaired files maintain original vector format and layout
- **AI-assisted Editing**: Outputs intermediate format that can be imported into AI editors
## Supported Scenarios
### 1. PDF Garbled Text Repair
- Box/question mark issues caused by font embedding problems
- Garbled text caused by encoding conversion errors
- Abnormal characters generated by missing font substitution
- Multi-language mixed encoding issues
### 2. SVG Garbled Text Repair
- Text entity encoding errors
- Special character escaping issues
- Display abnormalities caused by invalid font references
- XML encoding declaration errors
## Usage
### Command Line
```bash
# Fix a single PDF file
python scripts/main.py --input document.pdf --output fixed.pdf
# Fix a single SVG file
python scripts/main.py --input diagram.svg --output fixed.svg
# Batch process folder
python scripts/main.py --batch ./input_folder --output ./output_folder
# Interactive repair (manually specify replacement content)
python scripts/main.py --input doc.pdf --interactive
# Export as editable format (JSON)
python scripts/main.py --input doc.pdf --export-json editable.json
```
### Python API
```python
from scripts.main import VectorTextFixer
# Create fixer instance
fixer = VectorTextFixer()
# Fix PDF
result = fixer.fix_pdf("input.pdf", "output.pdf")
# Fix SVG
result = fixer.fix_svg("input.svg", "output.svg")
# Batch processing
results = fixer.batch_fix("./input_folder", "./output_folder")
# Get text map (for AI editing)
text_map = fixer.extract_text_map("input.pdf")
```
## Input Parameters
| Parameter | Type | Required | Description |
|------|------|------|------|
| `--input` | str | Yes* | Input file path (PDF or SVG) |
| `--batch` | str | No | Batch processing input folder |
| `--output` | str | Yes* | Output file/folder path |
| `--interactive` | bool | No | Enable interactive repair mode |
| `--export-json` | str | No | Export editable JSON format |
| `--encoding` | str | No | Specify source file encoding (default: auto-detect) |
| `--font-substitution` | dict | No | Font replacement mapping |
| `--repair-level` | str | No | Repair level: minimal, standard, aggressive (default: standard) |
*At least one of --input and --batch is required
## Output Format
### Repaired PDF/SVG
- Maintains original vector format
- Garbled text replaced with readable content
- Fonts and layout remain unchanged
### JSON Export Format
```json
{
"file_type": "pdf",
"pages": [
{
"page_num": 1,
"text_blocks": [
{
"id": "tb_001",
"bbox": [100, 200, 300, 220],
"original_text": "�����",
"detected_encoding": "UTF-8",
"confidence": 0.3,
"suggested_fix": "Sample Text"
}
]
}
],
"fonts_used": ["Arial", "SimSun"],
"repair_summary": {
"total_blocks": 15,
"fixed_blocks": 12,
"skipped_blocks": 3
}
}
```
## Garbled Text Detection Rules
The tool uses the following rules to detect garbled text:
1. **Replacement Character Detection**: Identifies U+FFFD (�) and box characters
2. **Control Character Filtering**: Excludes non-printing control characters
3. **Encoding Consistency**: Detects anomalies caused by mixed encodings
4. **Font Fallback Detection**: Identifies substitution characters generated due to missing fonts
5. **Probability Model**: Garbled text probability assessment based on character frequency
## Repair Strategies
### Minimal
- Only repairs obvious errors (replacement characters, null bytes)
- Maintains maximum integrity of original text
- Suitable for minor garbled text issues
### Standard
- Repairs common encoding issues
- Smart font replacement
- Balances repair rate and accuracy
### Aggressive
- Comprehensive text re-encoding
- Uses OCR-assisted recognition
- Suitable for severely garbled documents
## Examples
### Fix Single Page PDF
**Input**:
```bash
python scripts/main.py --input report.pdf --output fixed_report.pdf
```
**Output**:
```
✓ Processing: report.pdf
✓ Detected 5 garbled text blocks
✓ Fixed 4 blocks automatically
⚠ 1 block requires manual review
✓ Output saved: fixed_report.pdf
✓ Report saved: fixed_report_repair_log.json
```
### Export Editable JSON
**Input**:
```bash
python scripts/main.py --input diagram.svg --export-json editable.json
```
**Output JSON Structure**:
```json
{
"file_type": "svg",
"svg_info": {
"width": 800,
"height": 600,
"viewBox": "0 0 800 600"
},
"text_elements": [
{
"id": "text_1",
"x": 100,
"y": 200,
"font_family": "Arial",
"font_size": 14,
"original": "�����",
"user_editable": "",
"confidence": 0.25
}
]
}
```
## Dependencies
```
pdfplumber>=0.10.0 # PDF parsing
PyMuPDF>=1.23.0 # PDF processing (fitz)
cairosvg>=2.7.0 # SVG conversion
beautifulsoup4>=4.12.0 # SVG parsing
fonttools>=4.40.0 # Font processing
chardet>=5.0.0 # Encoding detection
Pillow>=10.0.0 # Image processing
```
## Limitations
- Encrypted PDFs require password unlock before processing
- Severely damaged vector files may not be fully repairable
- Some rare fonts may not map correctly
- Scanned PDFs require OCR recognition first
## Version Information
- **Version**: 1.0.0
- **Last Updated**: 2026-02-06
- **Status**: Ready for use
## Risk Assessment
| Risk Indicator | Assessment | Level |
|----------------|------------|-------|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
## Security Checklist
- [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
## Prerequisites
```bash
# Python dependencies
pip install -r requirements.txt
```
## Evaluation Criteria
### Success Metrics
- [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
### Test Cases
1. **Basic Functionality**: Standard input → Expected output
2. **Edge Case**: Invalid input → Graceful error handling
3. **Performance**: Large dataset → Acceptable processing time
## Lifecycle Status
- **Current Stage**: Draft
- **Next Review Date**: 2026-03-06
- **Known Issues**: None
- **Planned Improvements**:
- Performance optimization
- Additional feature support
标签
skill
ai