student.uva.nl
Welke opleiding volg je?
UvA Logo
Welke opleiding volg je?
Colloquiumpunten

Presentation Master's Thesis- Marek Lazár- PML

Laatst gewijzigd op 18-09-2025 13:53
Evaluating the Impact of Surrounding Textual Context on Chart Understanding and Retrieval with Vision-Language Models
Toon informatie voor jouw opleiding
Welke opleiding volg je?
of
event-summary.start-date
22-09-2025 14:30
event-summary.end-date
22-09-2025 15:30
event-summary.location

Roeterseilandcampus, Gebouw: G, Straat: Nieuwe Achtergracht 129-B, ruimte GS.01

Due to limited room capacity, attendance is on a first-come, first-served basis. Teachers must adhere to this.

Chart and infographic interpretation remains a persistent challenge for multimodal language models (MLLMs), especially in applied settings such as multimodal retrieval-augmented generation (RAG), where knowledge bases include complex visualizations. This research study investigates whether incorporating surrounding textual context enhances model performance through two complementary investigations.

Firstly, 61 charts were extracted from open-access research articles spanning diverse domains. Each chart was paired with its surrounding textual context. A high-capacity MLLM generated interpretations with or without context, and a blinded human expert rated each output on a 7-point Likert scale across accuracy, clarity, relevance, and completeness. Interpretations with context were preferred in 45 of 61 cases and scored significantly higher on accuracy (M = 6.59 vs. 6.05, W = 88.00, p < .001), relevance (M = 6.82 vs. 6.56, W = 77.00, p = .03), and completeness (M = 6.80 vs. 6.15, W = 43.50, p < .001), though clarity differences were not significant.

In a follow-up investigation, the same charts were interpreted by a smaller vision-language model, and the resulting narratives were embedded in a vector store to simulate retrieval in knowledge bases. Manually generated queries were used to evaluate retrieval performance independently of the model outputs. Interpretations with context achieved slightly higher cosine similarity (M = 0.34 vs. 0.31) and Top-1 hit rate (93.4% vs. 90.2%), though differences were not statistically significant (W = 785.00, p = .25). These results suggest that context may provide modest retrieval benefits even for lightweight models, supporting efficient, context-aware design for resource-constrained multimodal systems.

Together, the investigations demonstrate how context can bridge the gap between model capability and applied utility, informing both research and deployment of multimodal AI systems.