Generating Multimodal (Text and Image) Muslimwear Designs Using Stable Diffusion v1.5

Authors

  • Assyfa Febriwanti Universitas Islam Sultan Agung Author
  • Sam Farisa Chaerul Haviana Universitas Islam Sultan Agung Author

DOI:

https://doi.org/10.65310/abf5c845

Keywords:

Multimodal Learning, Stable Diffusion, Muslimwear Design, Low-Rank Adaptation, Image Generation.

Abstract

The rapid advancement of Generative Artificial Intelligence has accelerated the adoption of diffusion models in fashion design applications. However, conventional text-to-image approaches often encounter limitations in maintaining visual consistency and controllability during image generation. This study proposes a multimodal Muslimwear design generation system based on Stable Diffusion v1.5 by integrating textual prompts and reference images through a cross-attention fusion mechanism. The training dataset combines DeepFashion1 and a curated Muslimwear dataset, which were preprocessed through image normalization, resolution standardization, and automated caption generation using BLIP. Domain adaptation was performed using the Low-Rank Adaptation (LoRA) technique to enable computationally efficient fine-tuning. Performance evaluation employed Fréchet Inception Distance (FID) and Structural Similarity Index Measure (SSIM) to assess visual quality and structural consistency. Experimental results indicate that the female model achieved a FID score of 176.77 and an SSIM score of 0.311, outperforming the male model with a FID score of 256.22 and an SSIM score of 0.275. The findings demonstrate that multimodal conditioning enhances visual distribution learning and structural preservation, supporting the development of controllable and efficient AI-assisted fashion design systems.

Downloads

Download data is not yet available.

References

Baldrati, A., Morelli, D., Cornia, M., Bertini, M., & Cucchiara, R. (2026). Multimodal-conditioned latent diffusion models for fashion image editing. ACM Transactions on Multimedia Computing, Communications and Applications, 22(4), 1-27. https://doi.org/10.1145/3789212

Burapacheep, J., Gaur, I., Bhatia, A., & Thrush, T. (2024, August). Colorswap: A color and word order dataset for multimodal evaluation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 1716-1726). https://doi.org/10.18653/v1/2024.findings-acl.99

Chen, Y., & Ma, J. (2025). An intelligent generative method of fashion design combining attribute knowledge and Stable Diffusion Model. Textile Research Journal, 95(11-12), 1231-1254. https://doi.org/10.1177/00405175241289578

Fan, X., & Lyu, M. (2025). MRI Image Generation Based on Text Prompts. arXiv preprint arXiv:2505.22682. https://doi.org/10.48550/arXiv.2505.22682

Ghori, I., Karim, K., & Alkawadri, D. (2025, August). GenAI-Driven Image Generation Pipeline for Sustainable Garment Design and Waste Reduction in Fashion Production. In Proceedings of the AAAI Symposium Series (Vol. 6, No. 1, pp. 218-226). https://doi.org/10.1609/aaaiss.v6i1.36056

Hong, C. Y., & Liu, T. L. (2025, April). Multimodal promptable token merging for diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 39, No. 16, pp. 17231-17239). https://doi.org/10.1609/aaai.v39i16.33894

Huang, M., Long, Y., Deng, X., Chu, R., Xiong, J., Liang, X., ... & Liu, W. (2025, April). DialogGen: Multi-modal Interactive Dialogue System with Multi-turn Text-Image Generation. In Findings of the Association for Computational Linguistics: NAACL 2025 (pp. 411-426). https://doi.org/10.18653/v1/2025.findings-naacl.25

Huang, Y., Zhang, P., Liu, R., & Liang, J. (2025). Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?. arXiv preprint arXiv:2506.17623. https://doi.org/10.48550/arXiv.2506.17623

Kasodekar, K. S. (2024). Remote Diffusion. arXiv preprint arXiv:2405.04717. https://doi.org/10.48550/arXiv.2405.04717

Kuzmin, S., & Berezsky, O. (2025). Analysis of Diffusion Models and Biomedical Image Generation Tools. Computer systems and information technologies, (2), 8-19. https://doi.org/10.31891/csit-2025-2-1

Li, W., Xu, X., Liu, J., & Xiao, X. (2024, August). Unimo-g: Unified image generation through multimodal conditional diffusion. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 6173-6188). https://doi.org/10.18653/v1/2024.acl-long.335

Liu, L., Du, C., Pang, T., Wang, Z., Li, C., & Xu, D. (2024). Improving long-text alignment for text-to-image diffusion models. arXiv preprint arXiv:2410.11817. https://doi.org/10.48550/arXiv.2410.11817

Ma, Y., Yang, H., Wang, W., Fu, J., & Liu, J. (2023). Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319. https://doi.org/10.48550/arXiv.2303.09319

Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y. G., & Tao, D. (2024). A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint arXiv:2406.14555. https://doi.org/10.48550/arXiv.2406.14555

Wang, Y., Zhu, B., Hao, Y., Ngo, C. W., Tan, Y., & Wang, X. (2026). Cookingdiffusion: Cooking procedural image generation with stable diffusion. ACM Transactions on Multimedia Computing, Communications and Applications, 22(1), 1-24. https://doi.org/10.1145/3771995

Wu, H., Wu, X., Li, C., Zhang, Z., Chen, C., Liu, X., ... & Lin, W. (2024, October). T2i-scorer: Quantitative evaluation on text-to-image generation via fine-tuned large multi-modal models. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 3676-3685). https://doi.org/10.1145/3664647.3680939

Wu, X., Huang, S., & Wei, F. (2024). Multimodal large language model is a human-aligned annotator for text-to-image generation. arXiv preprint arXiv:2404.15100. https://doi.org/10.48550/arXiv.2404.15100

Wu, X., Zhang, D., Gan, R., Lu, J., Wu, Z., Sun, R., ... & Song, Y. (2024). Taiyi-Diffusion-XL: advancing bilingual text-to-image generation with large vision-language model support. arXiv preprint arXiv:2401.14688. https://doi.org/10.48550/arXiv.2401.14688

Xin, Y., Qin, Q., Luo, S., Zhu, K., Yan, J., Tai, Y., ... & Liu, Y. (2025). Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding. arXiv preprint arXiv:2510.06308. https://doi.org/10.48550/arXiv.2510.06308

Zhang, T., Wang, Z., Huang, J., Tasnim, M. M., & Shi, W. (2023). A survey of diffusion based image generation models: Issues and their solutions. arXiv preprint arXiv:2308.13142. https://doi.org/10.48550/arXiv.2308.13142

Downloads

Published

2026-04-29

How to Cite

Generating Multimodal (Text and Image) Muslimwear Designs Using Stable Diffusion v1.5. (2026). Journal of Science, Technology, and Innovation, 1(3), 363-373. https://doi.org/10.65310/abf5c845