Parsing PDFs

PDFs are the most common input to the Bitparse API. The parser handles single-page and multi-page documents up to 2,000 pages.

How It Works

You upload a PDF via POST /parse.
Bitparse splits the document into individual pages.
Each page is analyzed for text, tables, equations, figures, and charts.
The response contains one entry in the pages array per page, each with enriched XML.

Credit Cost

Each page in a PDF costs 1 credit. A 12-page PDF costs 12 credits. Credits are deducted when processing completes successfully.

If your account does not have enough credits for the entire document, the API returns 402 Payment Required before processing begins:

{
  "error": "insufficient credits: need 12, have 2"
}

Limits

Limit	Value
Max file size	10 MB
Max pages	2,000
Timeout	Up to 25 minutes

Large PDFs (100+ pages) may take several minutes. Consider using an HTTP client with a long timeout.

Example

curl -X POST https://api.bitparse.ai/parse \
  -H "X-API-Key: bp_YOUR_API_KEY" \
  -F "file=@research-paper.pdf"

import requests

resp = requests.post(
    "https://api.bitparse.ai/parse",
    headers={"X-API-Key": "bp_YOUR_API_KEY"},
    files={"file": open("research-paper.pdf", "rb")},
    timeout=1500,  # 25 minutes for large docs
)

data = resp.json()
print(f"Processed {data['total_pages']} pages")

Multi-Page Response

{
  "pages": [
    {
      "page_number": 1,
      "text": "<title id=\"page1_elem0\">Introduction to Machine Learning</title>\n\n<text id=\"page1_elem1\">Machine learning is a subset of artificial intelligence that enables systems to learn from data...</text>",
      "elements": [
        {"id": "page1_elem0", "type": "title", "content": "Introduction to Machine Learning", "image_data": ""},
        {"id": "page1_elem1", "type": "text", "content": "Machine learning is a subset of artificial intelligence that enables systems to learn from data...", "image_data": ""}
      ]
    },
    {
      "page_number": 2,
      "text": "<sub_title id=\"page2_elem0\">Supervised Learning</sub_title>\n\n<text id=\"page2_elem1\">In supervised learning, the model trains on labeled examples...</text>\n\n<equation id=\"page2_elem2\">\\[\\hat{y} = \\sigma(Wx + b)\\]</equation>\n\n<table id=\"page2_elem3\">\n| Algorithm       | Use Case         |\n|-----------------|------------------|\n| Linear Reg.     | Continuous output|\n| Random Forest   | Classification   |\n| SVM             | High-dim data    |\n</table>",
      "elements": [
        {"id": "page2_elem0", "type": "sub_title", "content": "Supervised Learning", "image_data": ""},
        {"id": "page2_elem1", "type": "text", "content": "In supervised learning, the model trains on labeled examples...", "image_data": ""},
        {"id": "page2_elem2", "type": "equation", "content": "\\[\\hat{y} = \\sigma(Wx + b)\\]", "image_data": ""},
        {"id": "page2_elem3", "type": "table", "content": "| Algorithm       | Use Case         |\n|-----------------|------------------|\n| Linear Reg.     | Continuous output|\n| Random Forest   | Classification   |\n| SVM             | High-dim data    |", "image_data": ""}
      ]
    },
    {
      "page_number": 3,
      "text": "<sub_title id=\"page3_elem0\">Results</sub_title>\n\n<text id=\"page3_elem1\">Figure 1 shows the training loss over 50 epochs...</text>\n\n<figure id=\"page3_elem2\"/>\n\n<chart id=\"page3_elem3\"/>",
      "elements": [
        {"id": "page3_elem0", "type": "sub_title", "content": "Results", "image_data": ""},
        {"id": "page3_elem1", "type": "text", "content": "Figure 1 shows the training loss over 50 epochs...", "image_data": ""},
        {"id": "page3_elem2", "type": "figure", "content": "", "image_data": "iVBORw0KGgoAAAANSUhEUg..."},
        {"id": "page3_elem3", "type": "chart", "content": "", "image_data": "iVBORw0KGgoAAAANSUhEUg..."}
      ]
    }
  ],
  "total_pages": 3,
  "processing_time_ms": 4512,
  "credits_used": 3,
  "credits_remaining": 497
}

Tips

Long timeouts: Set your HTTP client timeout to at least 25 minutes for documents over 100 pages.
Check credits first: Compare your credits_remaining (from any previous response) against the expected page count to avoid 402 errors.
Page ordering: Pages are always returned in document order. The page_number field is 1-indexed.