Skip to content

Conversation

aseembits93
Copy link
Contributor

@aseembits93 aseembits93 commented Oct 1, 2025

📄 88% (0.88x) speedup for outputs_to_objects in unstructured_inference/models/tables.py

⏱️ Runtime : 19.7 milliseconds 10.5 milliseconds (best of 31 runs)

📝 Explanation and details

The optimized code achieves an 87% speedup through several key optimizations:

1. Eliminated redundant list conversions and element-wise operations

  • Original: list(m.indices.detach().cpu().numpy())[0] creates an intermediate list
  • Optimized: Direct numpy array access m.indices.detach().cpu().numpy()[0]
  • Original: List comprehension [elem.tolist() for elem in rescale_bboxes(...)] calls .tolist() on each bbox individually
  • Optimized: Single .tolist() call after all tensor operations: rescaled.tolist()

2. Vectorized padding adjustment

  • Original: Per-element subtraction [float(elem) - shift_size for elem in bbox] in Python loop
  • Optimized: Tensor-wide subtraction rescaled = rescaled - pad before conversion to list
  • This leverages PyTorch's optimized C++ backend instead of Python loops

3. Reduced function call overhead

  • Original: objects.append() performs attribute lookup on each iteration
  • Optimized: append = objects.append caches the method reference, eliminating repeated lookups

4. GPU tensor optimization

  • Added device=out_bbox.device parameter to torch.tensor() creation to avoid potential device transfer overhead

Test case performance patterns:

  • Small cases (single objects): 5-7% improvement from reduced overhead
  • Large cases (500-1000 objects): 160-200% improvement due to vectorized operations scaling much better than element-wise Python loops
  • Mixed workloads: Consistent improvements across all scenarios, with larger gains when more objects need processing

The optimization is particularly effective for table detection models that typically process many bounding boxes simultaneously.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 27 Passed
🌀 Generated Regression Tests 28 Passed
⏪ Replay Tests 2 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
models/test_tables.py::test_padded_results_has_right_dimensions 687μs 444μs 54.8%✅
🌀 Generated Regression Tests and Runtime
from typing import Mapping, Tuple

# imports
import pytest  # used for our unit tests
import torch
from transformers.models.table_transformer.modeling_table_transformer import \
    TableTransformerObjectDetectionOutput
from unstructured_inference.models.tables import outputs_to_objects


# Helper to construct TableTransformerObjectDetectionOutput
def make_outputs(logits, pred_boxes, pad_for_structure_detection=None):
    d = {
        "logits": logits,
        "pred_boxes": pred_boxes,
    }
    if pad_for_structure_detection is not None:
        d["pad_for_structure_detection"] = pad_for_structure_detection
    return TableTransformerObjectDetectionOutput(**d)


# Basic test cases

def test_single_object_basic():
    # One object, label 1, score high, bounding box in center
    logits = torch.tensor([[[0.1, 0.9, 0.0]]])  # shape [1, 1, 3]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])  # shape [1, 1, 4]
    img_size = (100, 200)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 243μs -> 226μs (7.30% faster)
    # bbox should be roughly [40, 80, 60, 120] (centered, 20x40 size)
    bbox = objs[0]["bbox"]

def test_multiple_objects_basic():
    # Two objects, different labels
    logits = torch.tensor([[[0.1, 0.9, 0.0], [0.05, 0.05, 0.9]]])  # [1, 2, 3]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2], [0.3, 0.7, 0.1, 0.1]]])  # [1, 2, 4]
    img_size = (100, 200)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 292μs -> 274μs (6.43% faster)
    # Scores should be correct
    scores = [o["score"] for o in objs]

def test_no_object_class_filtered():
    # One object, but label is "no object"
    logits = torch.tensor([[[0.99, 0.005, 0.005]]])  # [1, 1, 3]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])  # [1, 1, 4]
    img_size = (100, 200)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 214μs -> 202μs (6.01% faster)


def test_empty_logits_and_boxes():
    # No objects at all
    logits = torch.empty((1, 0, 3))
    pred_boxes = torch.empty((1, 0, 4))
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 182μs -> 178μs (2.03% faster)

def test_all_no_object():
    # Multiple objects, all labeled "no object"
    logits = torch.tensor([[[0.99, 0.005], [0.98, 0.01]]])  # [1, 2, 2]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2], [0.4, 0.6, 0.1, 0.1]]])  # [1, 2, 4]
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 278μs -> 263μs (5.56% faster)

def test_non_integer_class_labels():
    # Class labels mapping is not sequential
    logits = torch.tensor([[[0.1, 0.9, 0.0], [0.05, 0.05, 0.9]]])  # [1, 2, 3]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2], [0.3, 0.7, 0.1, 0.1]]])  # [1, 2, 4]
    img_size = (100, 200)
    class_idx2name = {0: "no object", 5: "table", 7: "cell"}
    # The function expects indices to map, so the mapping must be correct for indices
    # So we simulate the labels as 5 and 7, but the logits argmax will be 1 and 2, so mapping must be for 1 and 2
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 225μs -> 213μs (5.68% faster)

def test_bounding_box_extreme_values():
    # Bounding box at edge of image
    logits = torch.tensor([[[0.05, 0.95, 0.0]]])  # [1, 1, 3]
    pred_boxes = torch.tensor([[[0.0, 0.0, 0.1, 0.1]]])  # [1, 1, 4], top-left corner
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 213μs -> 203μs (4.87% faster)
    bbox = objs[0]["bbox"]

def test_bounding_box_out_of_bounds():
    # Bounding box with values > 1, should be scaled
    logits = torch.tensor([[[0.05, 0.95, 0.0]]])  # [1, 1, 3]
    pred_boxes = torch.tensor([[[1.2, 1.2, 0.5, 0.5]]])  # [1, 1, 4], out of bounds
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 213μs -> 201μs (5.97% faster)
    bbox = objs[0]["bbox"]

def test_logits_with_ties():
    # Two classes have same probability, should pick the first max
    logits = torch.tensor([[[0.5, 0.5, 0.0]]])  # [1, 1, 3]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])  # [1, 1, 4]
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 211μs -> 200μs (5.18% faster)

def test_logits_all_zero():
    # All logits zero, softmax will be uniform
    logits = torch.zeros((1, 1, 3))
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 211μs -> 201μs (5.15% faster)

def test_logits_negative_values():
    # Negative logits, softmax still works
    logits = torch.tensor([[[0.0, -1.0, -2.0]]])
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table", 2: "cell"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 211μs -> 201μs (4.89% faster)


def test_large_number_of_objects():
    # 1000 objects, all labeled "table"
    n = 1000
    logits = torch.zeros((1, n, 2))
    logits[0, :, 1] = 100.0  # High score for "table"
    pred_boxes = torch.rand((1, n, 4))
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 2.75ms -> 940μs (192% faster)
    # All bboxes should have 4 floats
    for o in objs:
        pass

def test_large_image_size():
    # Large image size, bounding boxes should scale accordingly
    logits = torch.tensor([[[0.1, 0.9]]])
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])
    img_size = (10000, 20000)
    class_idx2name = {0: "no object", 1: "table"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 215μs -> 205μs (4.90% faster)
    bbox = objs[0]["bbox"]


def test_large_number_of_no_object():
    # 1000 objects, all "no object"
    n = 1000
    logits = torch.zeros((1, n, 2))
    logits[0, :, 0] = 100.0  # High score for "no object"
    pred_boxes = torch.rand((1, n, 4))
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 2.15ms -> 824μs (161% faster)

def test_large_mixed_objects():
    # 500 "table", 500 "no object"
    n = 1000
    logits = torch.zeros((1, n, 2))
    logits[0, :500, 1] = 100.0  # "table"
    logits[0, 500:, 0] = 100.0  # "no object"
    pred_boxes = torch.rand((1, n, 4))
    img_size = (100, 100)
    class_idx2name = {0: "no object", 1: "table"}
    outputs = make_outputs(logits, pred_boxes)
    codeflash_output = outputs_to_objects(outputs, img_size, class_idx2name); objs = codeflash_output # 2.38ms -> 811μs (194% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Mapping, Tuple

# imports
import pytest  # used for our unit tests
import torch
from transformers.models.table_transformer.modeling_table_transformer import \
    TableTransformerObjectDetectionOutput
from unstructured_inference.models.tables import outputs_to_objects

# unit tests

# Helper to build dummy TableTransformerObjectDetectionOutput
def make_outputs(logits, pred_boxes, pad_for_structure_detection=None):
    out = {
        "logits": logits,
        "pred_boxes": pred_boxes,
    }
    if pad_for_structure_detection is not None:
        out["pad_for_structure_detection"] = pad_for_structure_detection
    return TableTransformerObjectDetectionOutput(**out)

# Basic class mapping
CLASS_IDX2NAME = {
    0: "no object",
    1: "table",
    2: "cell",
    3: "row",
    4: "column",
}

# 1. Basic Test Cases

def test_single_object_detected():
    # One object, label 'table'
    logits = torch.tensor([[[0.1, 0.8, 0.05, 0.03, 0.02]]])  # shape [1,1,5]
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])      # shape [1,1,4]
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 217μs -> 203μs (7.13% faster)
    # Check bbox is in correct format
    bbox = result[0]["bbox"]
    # Check bbox values are within image bounds
    for v in bbox:
        pass

def test_multiple_objects_detected():
    # Two objects, 'table' and 'cell'
    logits = torch.tensor([[
        [0.1, 0.8, 0.05, 0.03, 0.02],  # table
        [0.05, 0.1, 0.7, 0.1, 0.05],   # cell
    ]])
    pred_boxes = torch.tensor([[
        [0.5, 0.5, 0.2, 0.2],
        [0.3, 0.3, 0.1, 0.1],
    ]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 264μs -> 212μs (24.6% faster)
    labels = [obj["label"] for obj in result]

def test_no_object_detected():
    # All objects are 'no object'
    logits = torch.tensor([[
        [0.9, 0.025, 0.025, 0.025, 0.025],  # no object
        [0.99, 0.0025, 0.0025, 0.0025, 0.0025],  # no object
    ]])
    pred_boxes = torch.tensor([[
        [0.5, 0.5, 0.2, 0.2],
        [0.3, 0.3, 0.1, 0.1],
    ]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 225μs -> 210μs (7.00% faster)

def test_mixed_objects_and_no_object():
    # One 'table', one 'no object'
    logits = torch.tensor([[
        [0.1, 0.8, 0.05, 0.03, 0.02],  # table
        [0.9, 0.025, 0.025, 0.025, 0.025],  # no object
    ]])
    pred_boxes = torch.tensor([[
        [0.5, 0.5, 0.2, 0.2],
        [0.3, 0.3, 0.1, 0.1],
    ]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 224μs -> 211μs (5.99% faster)


def test_empty_logits_and_boxes():
    # No objects at all
    logits = torch.empty((1, 0, 5))
    pred_boxes = torch.empty((1, 0, 4))
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 178μs -> 172μs (3.06% faster)

def test_single_object_all_classes_equal_prob():
    # Single object, all classes have equal probability
    logits = torch.tensor([[[0.2, 0.2, 0.2, 0.2, 0.2]]])
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    # Should pick the first class (0: 'no object') due to argmax tie-break
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 215μs -> 205μs (4.95% faster)

def test_bbox_out_of_bounds():
    # Bbox coordinates outside [0,1]
    logits = torch.tensor([[[0.1, 0.8, 0.05, 0.03, 0.02]]])
    pred_boxes = torch.tensor([[[1.5, 1.5, 1.2, 1.2]]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 214μs -> 204μs (4.97% faster)
    # Bbox may be outside image bounds, but should still be computed
    bbox = result[0]["bbox"]



def test_multiple_objects_all_no_object():
    # Multiple objects, all 'no object'
    logits = torch.tensor([[
        [0.99, 0.0025, 0.0025, 0.0025, 0.0025],
        [0.99, 0.0025, 0.0025, 0.0025, 0.0025],
        [0.99, 0.0025, 0.0025, 0.0025, 0.0025],
    ]])
    pred_boxes = torch.tensor([[
        [0.5, 0.5, 0.2, 0.2],
        [0.3, 0.3, 0.1, 0.1],
        [0.7, 0.7, 0.2, 0.2],
    ]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 285μs -> 272μs (4.93% faster)

def test_non_integer_labels():
    # Logits produce non-integer labels (should always be integer, but test float conversion)
    logits = torch.tensor([[[0.1, 0.8, 0.05, 0.03, 0.02]]])
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 222μs -> 207μs (7.59% faster)

# 3. Large Scale Test Cases

def test_many_objects_detected():
    # 500 objects, alternating 'table' and 'cell'
    num_objs = 500
    logits = torch.zeros((1, num_objs, 5))
    for i in range(num_objs):
        if i % 2 == 0:
            logits[0, i, 1] = 1.0  # 'table'
        else:
            logits[0, i, 2] = 1.0  # 'cell'
    pred_boxes = torch.rand((1, num_objs, 4))
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 1.53ms -> 577μs (165% faster)
    # Check correct label alternation
    for i, obj in enumerate(result):
        expected = "table" if i % 2 == 0 else "cell"

def test_large_image_size():
    # Large image size, single object
    logits = torch.tensor([[[0.1, 0.8, 0.05, 0.03, 0.02]]])
    pred_boxes = torch.tensor([[[0.5, 0.5, 0.2, 0.2]]])
    outputs = make_outputs(logits, pred_boxes)
    img_size = (10000, 20000)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 216μs -> 203μs (6.21% faster)
    bbox = result[0]["bbox"]
    # Bbox values should be within image bounds
    for v in bbox:
        pass

def test_many_objects_all_no_object():
    # 500 objects, all 'no object'
    num_objs = 500
    logits = torch.zeros((1, num_objs, 5))
    logits[:, :, 0] = 1.0  # all 'no object'
    pred_boxes = torch.rand((1, num_objs, 4))
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 1.20ms -> 447μs (168% faster)


def test_maximum_tensor_size():
    # Tensor size just under 100MB
    # Each float32 is 4 bytes, so max elements: 100_000_000 / 4 = 25_000_000
    # For shape [1, N, 5], N = 1_000_000 gives 5_000_000 elements (20MB)
    num_objs = 1_000
    logits = torch.zeros((1, num_objs, 5))
    logits[:, :, 1] = 1.0  # all 'table'
    pred_boxes = torch.rand((1, num_objs, 4))
    outputs = make_outputs(logits, pred_boxes)
    img_size = (100, 200)
    codeflash_output = outputs_to_objects(outputs, img_size, CLASS_IDX2NAME); result = codeflash_output # 2.73ms -> 914μs (198% faster)
    for obj in result:
        for v in obj["bbox"]:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_test_unstructured_inference__replay_test_0.py::test_unstructured_inference_models_tables_outputs_to_objects 1.27ms 850μs 49.2%✅

To edit these changes git checkout codeflash/optimize-outputs_to_objects-metbo2xp and push.

Codeflash

codeflash-ai bot and others added 2 commits August 27, 2025 01:54
The optimized code achieves an 87% speedup through several key optimizations:

**1. Eliminated redundant list conversions and element-wise operations**
- **Original**: `list(m.indices.detach().cpu().numpy())[0]` creates an intermediate list
- **Optimized**: Direct numpy array access `m.indices.detach().cpu().numpy()[0]`
- **Original**: List comprehension `[elem.tolist() for elem in rescale_bboxes(...)]` calls `.tolist()` on each bbox individually
- **Optimized**: Single `.tolist()` call after all tensor operations: `rescaled.tolist()`

**2. Vectorized padding adjustment**
- **Original**: Per-element subtraction `[float(elem) - shift_size for elem in bbox]` in Python loop
- **Optimized**: Tensor-wide subtraction `rescaled = rescaled - pad` before conversion to list
- This leverages PyTorch's optimized C++ backend instead of Python loops

**3. Reduced function call overhead**
- **Original**: `objects.append()` performs attribute lookup on each iteration
- **Optimized**: `append = objects.append` caches the method reference, eliminating repeated lookups

**4. GPU tensor optimization**  
- Added `device=out_bbox.device` parameter to `torch.tensor()` creation to avoid potential device transfer overhead

**Test case performance patterns:**
- **Small cases (single objects)**: 5-7% improvement from reduced overhead
- **Large cases (500-1000 objects)**: 160-200% improvement due to vectorized operations scaling much better than element-wise Python loops
- **Mixed workloads**: Consistent improvements across all scenarios, with larger gains when more objects need processing

The optimization is particularly effective for table detection models that typically process many bounding boxes simultaneously.
@aseembits93 aseembits93 changed the title ⚡️ Speed up function zoom_image by 56% ⚡️ Speed up function outputs_to_objects by 88% Oct 1, 2025
@qued
Copy link
Contributor

qued commented Oct 8, 2025

Closing and handling merge in #443

@qued qued closed this Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants