Abstract:
Blind Image Quality Assessment (BIQA) aims to simulate human prediction of image quality distortion levels and provide quality scores. However, existing unimodal-based BIQAs have limited representational ability when facing complex contents and distortion types, and the predicted scores also fail to provide explanatory descriptions which further affects the credibility of their prediction results. To address these challenges, we propose an eXplainable Blind Image Quality Assessment (xBIQA) guided by Large Language Model (LLM). Our method leverages image distortion and overall description to generate global quality text, while local quality text is produced to provide detailed descriptions of specific areas. These global texts, local texts, and prompts are then jointly fed into an LLM to generate detailed semantic features. Compared to traditional BIQA methods based on a single image modality, our approach demonstrates that LLMs can effectively produce text descriptions highly correlated with image quality, thereby enhancing the performance of BIQA models based on multimodal learning. Then, we align and fuse the text semantic features and the image texture features, and regress to obtain the image quality score, while outputting its corresponding quality explanatory description. Experimental results show that our xBIQA performs best on the KonIQ-10k and LIVE Challenge datasets, with improvements of 1.64% and 2.60% in the SRCC metric, respectively.