Generasi Desain Pakaian Muslimwear Berbasis Multimodal (Teks dan Gambar) Menggunakan Stable Diffusion v1.5
DOI:
https://doi.org/10.65310/abf5c845Kata Kunci:
Multimodal Learning, Stable Diffusion, Muslimwear Design, Low-Rank Adaptation, Image Generation.Abstrak
The rapid advancement of Generative Artificial Intelligence has accelerated the adoption of diffusion models in fashion design applications. However, conventional text-to-image approaches often encounter limitations in maintaining visual consistency and controllability during image generation. This study proposes a multimodal Muslimwear design generation system based on Stable Diffusion v1.5 by integrating textual prompts and reference images through a cross-attention fusion mechanism. The training dataset combines DeepFashion1 and a curated Muslimwear dataset, which were preprocessed through image normalization, resolution standardization, and automated caption generation using BLIP. Domain adaptation was performed using the Low-Rank Adaptation (LoRA) technique to enable computationally efficient fine-tuning. Performance evaluation employed Fréchet Inception Distance (FID) and Structural Similarity Index Measure (SSIM) to assess visual quality and structural consistency. Experimental results indicate that the female model achieved a FID score of 176.77 and an SSIM score of 0.311, outperforming the male model with a FID score of 256.22 and an SSIM score of 0.275. The findings demonstrate that multimodal conditioning enhances visual distribution learning and structural preservation, supporting the development of controllable and efficient AI-assisted fashion design systems.
Unduhan
Referensi
Baldrati, A., Morelli, D., Cornia, M., Bertini, M., & Cucchiara, R. (2026). Multimodal-conditioned latent diffusion models for fashion image editing. ACM Transactions on Multimedia Computing, Communications and Applications, 22(4), 1-27. https://doi.org/10.1145/3789212
Burapacheep, J., Gaur, I., Bhatia, A., & Thrush, T. (2024, August). Colorswap: A color and word order dataset for multimodal evaluation. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 1716-1726). https://doi.org/10.18653/v1/2024.findings-acl.99
Chen, Y., & Ma, J. (2025). An intelligent generative method of fashion design combining attribute knowledge and Stable Diffusion Model. Textile Research Journal, 95(11-12), 1231-1254. https://doi.org/10.1177/00405175241289578
Fan, X., & Lyu, M. (2025). MRI Image Generation Based on Text Prompts. arXiv preprint arXiv:2505.22682. https://doi.org/10.48550/arXiv.2505.22682
Ghori, I., Karim, K., & Alkawadri, D. (2025, August). GenAI-Driven Image Generation Pipeline for Sustainable Garment Design and Waste Reduction in Fashion Production. In Proceedings of the AAAI Symposium Series (Vol. 6, No. 1, pp. 218-226). https://doi.org/10.1609/aaaiss.v6i1.36056
Hong, C. Y., & Liu, T. L. (2025, April). Multimodal promptable token merging for diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 39, No. 16, pp. 17231-17239). https://doi.org/10.1609/aaai.v39i16.33894
Huang, M., Long, Y., Deng, X., Chu, R., Xiong, J., Liang, X., ... & Liu, W. (2025, April). DialogGen: Multi-modal Interactive Dialogue System with Multi-turn Text-Image Generation. In Findings of the Association for Computational Linguistics: NAACL 2025 (pp. 411-426). https://doi.org/10.18653/v1/2025.findings-naacl.25
Huang, Y., Zhang, P., Liu, R., & Liang, J. (2025). Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?. arXiv preprint arXiv:2506.17623. https://doi.org/10.48550/arXiv.2506.17623
Kasodekar, K. S. (2024). Remote Diffusion. arXiv preprint arXiv:2405.04717. https://doi.org/10.48550/arXiv.2405.04717
Kuzmin, S., & Berezsky, O. (2025). Analysis of Diffusion Models and Biomedical Image Generation Tools. Computer systems and information technologies, (2), 8-19. https://doi.org/10.31891/csit-2025-2-1
Li, W., Xu, X., Liu, J., & Xiao, X. (2024, August). Unimo-g: Unified image generation through multimodal conditional diffusion. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 6173-6188). https://doi.org/10.18653/v1/2024.acl-long.335
Liu, L., Du, C., Pang, T., Wang, Z., Li, C., & Xu, D. (2024). Improving long-text alignment for text-to-image diffusion models. arXiv preprint arXiv:2410.11817. https://doi.org/10.48550/arXiv.2410.11817
Ma, Y., Yang, H., Wang, W., Fu, J., & Liu, J. (2023). Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319. https://doi.org/10.48550/arXiv.2303.09319
Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y. G., & Tao, D. (2024). A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint arXiv:2406.14555. https://doi.org/10.48550/arXiv.2406.14555
Wang, Y., Zhu, B., Hao, Y., Ngo, C. W., Tan, Y., & Wang, X. (2026). Cookingdiffusion: Cooking procedural image generation with stable diffusion. ACM Transactions on Multimedia Computing, Communications and Applications, 22(1), 1-24. https://doi.org/10.1145/3771995
Wu, H., Wu, X., Li, C., Zhang, Z., Chen, C., Liu, X., ... & Lin, W. (2024, October). T2i-scorer: Quantitative evaluation on text-to-image generation via fine-tuned large multi-modal models. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 3676-3685). https://doi.org/10.1145/3664647.3680939
Wu, X., Huang, S., & Wei, F. (2024). Multimodal large language model is a human-aligned annotator for text-to-image generation. arXiv preprint arXiv:2404.15100. https://doi.org/10.48550/arXiv.2404.15100
Wu, X., Zhang, D., Gan, R., Lu, J., Wu, Z., Sun, R., ... & Song, Y. (2024). Taiyi-Diffusion-XL: advancing bilingual text-to-image generation with large vision-language model support. arXiv preprint arXiv:2401.14688. https://doi.org/10.48550/arXiv.2401.14688
Xin, Y., Qin, Q., Luo, S., Zhu, K., Yan, J., Tai, Y., ... & Liu, Y. (2025). Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding. arXiv preprint arXiv:2510.06308. https://doi.org/10.48550/arXiv.2510.06308
Zhang, T., Wang, Z., Huang, J., Tasnim, M. M., & Shi, W. (2023). A survey of diffusion based image generation models: Issues and their solutions. arXiv preprint arXiv:2308.13142. https://doi.org/10.48550/arXiv.2308.13142
Unduhan
Diterbitkan
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2026 Assyfa Febriwanti, Sam Farisa Chaerul Haviana (Author)

Artikel ini berlisensi Creative Commons Attribution 4.0 International License.




















