Albert Mundu

I’m a PhD scholar at CVBL, IIIT-Allahabad researching multimodal AI — how vision and language models can be combined to understand images and videos. My recent work spans multi-level feature fusion for image captioning, distilling foundational VLMs into task-specific models through staged pretraining and finetuning, and optimizing captioning with recent state-of-the-art RL methods.

I also teach graduate courses in Machine Learning and Algorithms at Galgotias University, and previously interned at Spyne AI working on e-commerce shadow generation with conditional VAEs, GANs, and diffusion models.

Always open to research collaborations or a good conversation about CV, NLP, or generative models — feel free to reach out.

news

Dec 14, 2025	Attended and presented ThreatNet at IEEE UPCON 2025.
Oct 10, 2025	ThreatNet: Multimodal Firearm Threat Assessment Network accepted in IEEE UPCON 2025.
Aug 27, 2024	ETransCap: Efficient Transformer for Image Captioning, Applied Intelligence, Springer is in press.
Aug 24, 2023	Joined Galgotias University, Greater Noida as Assistant Professor.

selected publications

UPCON
ThreatNet: Multimodal Firearm Threat Assessment Network

Albert Mundu, Satish Kumar Singh, and Shiv Ram Dubey

In IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering, 2025

Abs Bib

Firearm-related violence poses a persistent threat to public safety, often challenging even law enforcement. In this paper, we introduce ThreatNet, a multimodal firearm threat assessment network that detects weapons in unconstrained environments and evaluates scene severity. ThreatNet integrates a YOLOv10-based object detector with a dual-branched transformer-based image captioner to generate descriptive scene narratives and classify threat levels. We present the YoutubeGDD caption dataset, an extension of YoutubeGDD, featuring real-world weapon images with five captions per image to support multimodal analysis. We finetune the pretrained detector on YoutubeGDD dataset for weapon recognition, while we train the captioner and threat classifier on both YoutubeGDD caption and MS-COCO caption datasets. We evaluate model performance using captioning metrics and threat classification accuracy, and benchmark YOLOv10 variants on YoutubeGDD dataset and our captioner on MS-COCO caption dataset.
@inproceedings{upcon25_threatnet, title = {ThreatNet: Multimodal Firearm Threat Assessment Network}, author = {Mundu, Albert and Kumar Singh, Satish and Ram Dubey, Shiv}, booktitle = {IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering}, year = {2025}, publisher = {IEEE}, }
AI, Springer
ETransCap: Efficient Transformer for Image Captioning

Albert Mundu, Satish Kumar Singh, and Shiv Ram Dubey

Applied Intelligence, Aug 2024

Abs DOI Bib HTML

Image captioning is a challenging task in computer vision that automatically generates a textual description of an image by integrating visual and linguistic information, as the generated captions must accurately describe the image’s content while also adhering to the conventions of natural language. We adopt the encoder-decoder framework employed by various CNN-RNN-based models for image captioning in the past few years. Recently, we observed that the CNN-Transformer-based models have achieved great success and surpassed traditional CNN-RNN-based models in the area. Many researchers have concentrated on Transformers, exploring and uncovering its vast possibilities. Unlike conventional CNN-RNN-based models in image captioning, transformer-based models have achieved notable success and offer the benefit of handling longer input sequences more efficiently. However, they are resource-intensive to train and deploy, particularly for large-scale tasks or for tasks that require real-time processing. In this work, we introduce a lightweight and efficient transformer-based model called the Efficient Transformer Captioner (ETransCap), which consumes fewer computation resources to generate captions. Our model operates in linear complexity and has been trained and tested on MS-COCO dataset. Comparisons with existing state-of-the-art models show that ETransCap achieves promising results. Our results support the potential of ETransCap as a good approach for image captioning tasks in real-time applications. Code for this project will be available at https://github.com/albertmundu/etranscap.
@article{Mundu2024, author = {Mundu, Albert and Singh, Satish Kumar and Dubey, Shiv Ram}, title = {ETransCap: Efficient Transformer for Image Captioning}, journal = {Applied Intelligence}, year = {2024}, month = aug, day = {27}, publisher = {Springer}, issn = {1573-7497}, doi = {10.1007/s10489-024-05739-w}, url = {https://doi.org/10.1007/s10489-024-05739-w}, }