Judging Image Safety with Multimodal Large Language Models: A Zero-Shot Approach

Multimodal Large Language Models as Judges for Image Safety: A New Approach without Human Labels

The safety of image content has become a central challenge with the rise of visual media on online platforms. Especially in the age of AI-generated content (AIGC), many image generation models can produce harmful content, including images with sexual or violent depictions. The identification of such unsafe images based on established safety rules is therefore of crucial importance.

Pretrained Multimodal Large Language Models (MLLMs) offer potential for this task due to their strong pattern recognition capabilities. Previous approaches typically rely on finetuning MLLMs with human-labeled datasets. However, this comes with some disadvantages. Human annotation of data according to complex guidelines is both cost and time-intensive. Furthermore, safety guidelines often need to be updated, which further complicates finetuning with human annotations.

This raises the research question: Can unsafe images be detected by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Research shows that simple queries of pretrained MLLMs do not yield satisfactory results. This lack of effectiveness is due to factors such as the subjectivity of safety rules, the complexity of extensive constitutions, and inherent biases in the models.

A novel approach addresses these challenges through an MLLM-based method that objectifies safety rules, evaluates the relevance between rules and images, and makes rapid assessments based on unbiased token probabilities with logically complete but simplified precondition chains for safety rules. If necessary, additional in-depth reasoning processes are carried out with cascaded chain-of-thought processes.

Experimental results show that this method is highly effective for zero-shot image safety assessments. By avoiding human labels and leveraging the reasoning capabilities of MLLMs, this approach offers a promising solution for scalable and efficient moderation of image content.

The Challenges of Image Safety Assessment

Assessing image safety is a complex task that presents various challenges:

The subjectivity of safety rules: What is considered "safe" or "unsafe" can vary depending on cultural context, personal opinion, and specific application. This makes it difficult to develop universal safety guidelines.

The complexity of image content: Images can contain multi-layered meanings and contexts that are difficult for machines to interpret. Irony, sarcasm, and cultural references can further complicate automatic image analysis.

The scalability of moderation: With the explosive increase in online images, a scalable solution for image moderation is essential. Manual moderation is time-consuming and resource-intensive.

MLLMs as a Solution

MLLMs offer the potential to overcome these challenges. Their ability to process both text and images and perform complex reasoning allows for a more nuanced assessment of image content. By combining zero-shot learning with chain-of-thought prompting, MLLMs can interpret safety guidelines and apply them to new images without relying on extensive training data.

Outlook

Research on MLLM-based methods for image safety assessment is promising. Future work could focus on improving the robustness and accuracy of these models, especially in dealing with subtle or ambiguous content. The development of more transparent and explainable MLLMs is also of great importance to strengthen trust in automated moderation systems.

Bibliography: Wang, Z. et al. (2024). MLLM-as-a-Judge for Image Safety without Human Labeling. arXiv preprint arXiv:2501.00192. Ying, Z. et al. (2024). SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models. arXiv preprint arXiv:2410.18927v1. Li, D. et al. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. arXiv preprint arXiv:2411.16594v3. Srivastava, A. et al. (2024). Beyond the Imitation Game: Quantifying and extrapolating capabilities of language models. OpenReview. Cobbe, K. et al. (2024). LIMA: Less Is More for Alignment. arXiv preprint arXiv:2407.04842. Zhou, K. et al. (2024). Multimodal Situational Safety. arXiv preprint arXiv:2410.06172. Srivastava, A. et al. (2023). Beyond the Imitation Game Benchmark. NeurIPS Datasets and Benchmarks. DAIR.AI. ML-Papers-of-the-Week. GitHub repository. Wolfe, C. R. (2024). Using LLMs for Evaluation. Deep (Learning) Focus. Substack. Chen, Z. et al. (2024). MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?. arXiv preprint.

Judging Image Safety with Multimodal Large Language Models: A Zero-Shot Approach

Multimodal Large Language Models as Judges for Image Safety: A New Approach without Human Labels

The Challenges of Image Safety Assessment

MLLMs as a Solution

Outlook

Start for free now and experience the power of AI-driven knowledge management.