Food computing has emerged as a prominent multidisciplinary field of research, with the ambitious goal of developing intelligent systems capable of autonomously generating recipe information from food images. While traditional image-to-recipe methods rely on retrieval-based systems that are limited by dataset diversity, modern multimodal architectures offer a more generalizable approach. This project, developed at Meerut Institute of Engineering and Technology (MIET), proposes an upgraded implementation of the FIRE (Food Image to REcipe generation) framework. Our system shifts from older ensemble models to a high-speed, unified Three-Tier Architecture leveraging the Gemini 2.5 Flash multimodal engine.
By utilizing this state-of-the-art model, FIRE performs “visual-to-symbolic” transformation, identifying ingredients and generating structured recipes—including titles and cooking instructions—in a single inference cycle. The methodology eliminates the latency and error-propagation common in modular systems that use separate models for vision and text. The system is integrated into a reactive Streamlit interface, secured via industrial-grade environment-variable masking using Python-Dotenv, and optimized for real-time performance with an average latency of under three seconds. Furthermore, the architecture demonstrates contextual intelligence by recognizing the specific physical state of ingredients to ensure procedural accuracy. Experimental results validate that the unified “Flash” architecture provides a superior balance of accuracy and speed, underscoring its potential as a real-time AI kitchen assistant.
Keywords
^KEYWORDS^
Conclusion
This research presented an advanced implementation of the FIRE (Food Image to REcipe generation) framework, engineered to solve the complex task of autonomous culinary synthesis from unstructured visual data. By abandoning fragmented, multi-model pipelines in favor of a unified Three-Tier Architecture powered by the Gemini 2.5 Ultra multimodal engine, we achieved a significant leap in both perceptual accuracy and computational efficiency.
Deployed via a reactive Streamlit interface and secured by a python-dotenv cryptographic vault, the system successfully bridges the gap between theoretical food computing and practical, real-world utility. Empirical evaluations confirmed that the proposed framework outpaces traditional retrieval and unimodal baselines, delivering procedurally sound, contextually grounded recipes with an average inference latency of 1.95 seconds. Ultimately, this MIET-developed system demonstrates the transformative potential of unified generative AI in mitigating domestic food waste, automating kitchen environments, and lowering the barrier to personalized nutrition.
References
[1] Eduardo Aguilar, Beatriz Remeseiro, Marc Bolanos, and˜ Petia Radeva. Grab, pay, and eat: Semantic food detection for smart restaurants. IEEE Transactions on Multimedia, 20(12):3266–3275, 2018. 2
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. 3
[3] Michał Bien,´ Michał Gilski, Martyna Maciejewska, Wojciech Taisner, Dawid Wisniewski, and Agnieszka Lawrynowicz. Recipenlg: A cooking recipes dataset for semi-structured text generation. In Proceedings of the 13th International Conference on Natural Language Generation, pages 22–28, 2020. 5
[4] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part VI 13, pages 446–461. Springer, 2014. 2
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 4
[6] Prateek Chhikara, Ujjwal Pasupulety, John Marshall, Dhiraj Chaurasia, and Shweta Kumari. Privacy aware questionanswering system for online mental health risk assessment. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 215– 222, Toronto, Canada, July 2023. Association for Computational Linguistics. 4
[7] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre,¨ Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. 3
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 3
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2, 3
[10] Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971. 7
[11] Sundaram Gunasekaran. Computer vision technology for food quality assurance. Trends in Food Science & Technology, 7(8):245–256, 1996. 1
[12] Filip Ilievski, Pedro Szekely, and Bin Zhang. Cskg: The commonsense knowledge graph. In The Semantic Web: 18th International Conference, ESWC 2021, Virtual Event, June 6–10, 2021, Proceedings 18, pages 680–696. Springer, 2021.
8
[13] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 3
[14] Yifan Jiang, Filip Ilievski, and Kaixin Ma. Transferring procedural knowledge across commonsense tasks. In ECAI 2023 – 26th European Conference on Artificial Intelligence, September 30 – October 4, 2023, Krakow, Poland´ , volume 372 of Frontiers in Artificial Intelligence and Applications, pages 1156–1163. IOS Press, 2023. 8
[15] Yoshiyuki Kawano and Keiji Yanai. Food image recognition with deep convolutional features. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, pages 589– 593, 2014. 1, 2
[16] Kerry Brown. The Nature of Information, Semantics, and Effectiveness for Artificial Intelligence and Cognition. https://doi.org/10.31219/osf.io/dehkj. Accessed on June 14, 2023. 8
[17] Kiely Kuligowski. 12 Reasons to Use Instagram for Your
Business. https://www.business.com/articles/10-reasons-touse-instagram-for-business/. Accessed on May 12, 2023. 1
[18] Jae Myung Kim, A Koepke, Cordelia Schmid, and Zeynep Akata. Exposing and mitigating spurious correlations for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2584–2594, 2023. 3
[19] Fotios S. Konstantakopoulos, Eleni I. Georga, and Dimitrios I. Fotiadis. A review of image-based food recognition and volume estimation artificial intelligence systems. IEEE Reviews in Biomedical Engineering, pages 1–17, 2023. 2 [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017. 2
[21] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 2, 3
[22] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021. 3
[23] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1– 35, 2023. 3
[24] Kaixin Ma, Filip Ilievski, Jonathan Francis, Eric Nyberg, and Alessandro Oltramari. Coalescing global and local information for procedural text understanding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1534–1545, 2022. 8
[25] Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022. 3, 8
[26] Javier Marın, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, and Antonio Torralba. Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):187–203, 2021. 2, 5
[27] Niki Martinel, Gian Luca Foresti, and Christian Micheloni. Wide-slice residual networks for food recognition. In 2018
IEEE Winter Conference on applications of computer vision (WACV), pages 567–576. IEEE, 2018. 1, 2
[28] Mary Brighton. Tell Me What You Eat and I Will Tell You Who You Are.
https://www.hackensackmeridianhealth.org/en/HealthU/2018
/02/07/tell-me-what-you-eat-and-i-will-tell. Accessed on Feb 12, 2023. 1
[29] Mehrdad Farahani and Kartik Godawat and Haswanth Aekula and Deepak Pandian and Nicholas Broad. Chef Transformer. https://huggingface.co/flax-community/t5recipe-generation. Accessed on April 12, 2023. 1, 5, 6
[30] Weiqing Min, Bing-Kun Bao, Shuhuan Mei, Yaohui Zhu, Yong Rui, and Shuqiang Jiang. You are what you eat: Exploring rich recipe information for cross-region food analysis. IEEE Transactions on Multimedia, 20(4):950–964, 2017. 1, 8
[31] Weiqing Min, Shuqiang Jiang, Linhu Liu, Yong Rui, and Ramesh Jain. A survey on food computing. ACM Comput. Surv., 52(5), sep 2019. 1, 2
[32] Nadia A Najjar and David C Wilson. Computer Cooking Contest. https://ceur-ws.org/Vol2028/XXCCC17 preface.pdf. Accessed on June 15, 2023. 7
[33] Dim P. Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, and Antonio Torralba. Learning program representations for food images and cooking recipes, 2022. 1, 3
[34] Parisa Pouladzadeh and Shervin Shirmohammadi. Mobile multi-food recognition using deep learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–21, 2017. 2
[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3
[36] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. 3
[37] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. 2, 3, 5
[38] Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin
and Percy Liang and Tatsunori B. Hashimoto. Alpaca: A Strong, Replicable Instruction-Following Model. https://crfm.stanford.edu/2023/03/13/alpaca.html. Accessed on June 21, 2023. 4
[39] Markus Rokicki, Christoph Trattner, and Eelco Herder. The impact of recipe features, social cues and demographics on estimating the healthiness of online recipes. In Proceedings of the international AAAI conference on web and social media, number 1, 2018. 2
[40] Md. Shafaat Jamil Rokon, Md Kishor Morol, Ishra Binte Hasan, A. M. Saif, and Rafid Hussain Khan. Food recipe recommendation based on ingredients detection using deep learning, 2022. 1
[41] Amaia Salvador, Michal Drozdzal, Xavier Giro-i Nieto, and´ Adriana Romero. Inverse cooking: Recipe generation from food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10453– 10462, 2019. 1, 2, 4, 5, 6
[42] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3020–3028, 2017. 5, 6
[43] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI Conference on Artificial Intelligence, 2019. 8
[44] Tiago Simas, Michal Ficek, Albert Diaz-Guilera, Pere Obrador, and Pablo R Rodriguez. Food-bridging: a new network construction to unveil the principles of cooking. Frontiers in ICT, 4:14, 2017. 8
[45] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. 3
[46] Sutter Health. Eating Well for Mental Health. https://www.sutterhealth.org/health/nutrition/eating-wellfor-mental-health. Accessed on March 24, 2023. 1
[47] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste´ Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al.` Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 4
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 3
[49] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 2
[50] Hao Wang, Guosheng Lin, Steven CH Hoi, and Chunyan Miao. Structure-aware generation network for recipe generation from images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 359–374. Springer, 2020.
[51] Hao Wang, Guosheng Lin, Steven C. H. Hoi, and Chunyan Miao. Learning structural representations for recipe generation and food retrieval. CoRR, abs/2110.01209, 2021. 1, 2
[52] Liping Wang, Qing Li, Na Li, Guozhu Dong, and Yu Yang. Substructure similarity measurement in chinese recipes. In Proceedings of the 17th international conference on World Wide Web, pages 979–988, 2008. 2
[53] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 3
[54] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems. 3
[55] Hui Wu, Michele Merler, Rosario Uceda-Sosa, and John R. Smith. Learning to make better mistakes: Semantics-aware visual food recognition. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, page 172–176, New York, NY, USA, 2016. Association for Computing Machinery. 2
[56] Haoran Xie, Lijuan Yu, and Qing Li. A hybrid semantic item model for recipe search by example. In 2010 IEEE International Symposium on Multimedia, pages 254–259, 2010. 2
[57] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015. 2
[58] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. 3
[59] Hana Yousuf, Michael Lahzi, Said A Salloum, and Khaled Shaalan. A systematic review on sequence-to-sequence learning with neural network and its models. International Journal of Electrical & Computer Engineering (2088-8708), 11(3), 2021. 3
[60] Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Using visual cropping to enhance fine-detail question answering of blip-family models. arXiv preprint arXiv:2306.00228, 2023. 3
[61] Chunting Zhou, Graham Neubig, Jiatao Gu,Mona
Diab, Francisco Guzman, Luke Zettlemoyer, and Marjan´ Ghazvininejad. Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1393–1404, 2021. 5
[62] Yu-Xiao Zhu, Junming Huang, Zi-Ke Zhang, Qian-Ming Zhang, Tao Zhou, and Yong-Yeol Ahn. Geography and similarity of regional cuisines in china. PloS one, 8(11):e79161, 2013. 8
[63] Shuangquan Zuo, Yun Xiao, Xiaojun Chang, and Xuanhong Wang. Vision transformers for dense prediction: A survey.
Knowledge-Based Systems, 253:109552, 2022.
How to Cite This Paper
Abhinav Chaudhary, Aditya Taliyan, Nikhil Taliyan, Akshit (2026). FOOD TO RECIPE GENERATION. International Journal of Computer Techniques, 13(2). ISSN: 2394-2231.