Simone Rossetti

⬆️ Rainbow Mountain, Peru

📍 Rome, Italy

👋 Hi, I’m Simone, an AI Research Engineer working at the intersection of Computer Vision and Natural Language Processing. My work focuses on representation learning, vision-language alignment, and weakly, semi-, and self-supervised learning for semantic and instance-level visual understanding.

📚 My research interests lie in Multimodal Learning, Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAMs). I am particularly interested in grounding language into dense visual predictions (segmentation, tracking, affordances) and leveraging foundation models for zero- and few-shot transfer in structured vision tasks. A recurring theme is uncertainty modeling and probabilistic priors to improve robustness, calibration, and data efficiency under limited or noisy supervision.

🚀 My goal is to continue in research and development in embodied AI. From March 2026 I am in a deliberate transition toward roles where I can do exactly that: high-impact, production-grade intelligent systems. My focus is on multimodal foundation models (Vision-Language and Vision-Language-Action architectures), large-scale Transformer systems, and probabilistic modeling for sequential decision processes—combined with structured analysis of research trends in embodied AI and generative modeling, and hands-on prototyping to assess architectural trade-offs, scalability, and deployment feasibility.

🔙 I co-founded DeepPlants (Sep 2021 – Feb 2026), where I led research and engineering teams building production-grade, agentic AI systems for micro-farming management, plant phenotyping, and agri-tech automation. My experience spans the full research-to-production pipeline, from dataset design and large-scale multi-GPU training to model optimization and real-world deployment.

🔙 Previously, I was an AI Research Fellow at ALCOR Lab (Sapienza University of Rome), contributing to peer-reviewed research in computer vision, with a focus on instance segmentation and tracking and activity recognition.

🎓 I earned a PhD in Computer Science Engineering from DIAG, Sapienza University of Rome. My doctoral research focused on reducing supervision in semantic segmentation through Bayesian prior modeling and structured regularization. I hold an MSc in AI & Robotics and a BSc in Computer Engineering, with a background in automation and perception-action systems.

📄 My work has been presented at NeurIPS, ECCV, and ICCV. Selected publications and highlights are available on the Publications page.

📮 For collaborations reach out at simone[dot]rossetti[at]live[dot]com.

news

Mar 01, 2026	Leaving DeepPlants — pursuing R&D in embodied AI
Oct 01, 2025	CABBO applying to COSMIC and SmarTerra open calls
Jan 29, 2025	Lessons learned while designing a multimodal benchmark for agricultural decision support
Jan 20, 2025	CABBO – multimodal AI agent for EU micro-farming
Sep 15, 2024	Since September 2024 I have been leading the multimodal learning team at DeepPlants, focusing on combining vision, language and agronomic signals to build robust, data-efficient models for agricultural applications.

latest posts

Mar 16, 2026	VerifIoU: Verified Object Detection and the Next Frontier in Safe CV
Mar 15, 2026	DeepG: Certifying Geometric Robustness in Computer Vision
Mar 14, 2026	Why Intervals Break: DPNEURIFYFV and the Dependency Problem

selected publications

ECCV
Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation

Simone Rossetti^†*, Damiano Zappia^*, Marta Sanzari^*, and 2 more authors

In Computer Vision – ECCV. More Information can be found here. , 2022

Abs DOI Bib HTML PDF Supp Video Poster

Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve 69.3% mIoU on PascalVOC 2012 val set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.
@inproceedings{rossetti2022max, author = {Rossetti, Simone and Zappia, Damiano and Sanzari, Marta and Schaerf, Marco and Pirri, Fiora}, editor = {Avidan, Shai and Brostow, Gabriel and Ciss{\'e}, Moustapha and Farinella, Giovanni Maria and Hassner, Tal}, title = {Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation}, booktitle = {Computer Vision -- ECCV}, year = {2022}, publisher = {Springer Nature Switzerland}, address = {Cham}, pages = {446--463}, isbn = {978-3-031-20056-4}, doi = {10.1007/978-3-031-20056-4_26}, }
ICCV
A new Large Dataset and a Transfer Learning Methodology for Plant Phenotyping in Vertical Farms

Nico Samà^*, Etienne David°, Simone Rossetti^†*, and 3 more authors

In IEEE/CVF International Conference on Computer Vision Workshops. More Information can be found here. , Oct 2023

Abs DOI Bib HTML PDF

Vertical farming has emerged as a solution to enhance crop cultivation efficiency and overcome limitations in conventional farming methods. Yet, abiotic stresses significantly impact crop quality and increase the risk of food loss. The integration of advanced automation, sensor technology, and deep learning models offers a promising solution for quality monitoring addressing the limitations of stress-specific approaches. Due to the large range of possible quality issues, there is a need for a general method. This study proposes a new plant canopy dataset, dubbed AGM of 1M images, annotated with 18 classes, an in-depth analysis of its quality for its use in transfer learning, and a methodology for detecting canopy stresses in vertical farming. The present study trains ViTbase8, ViTsmall8, and ResNet50 both on ImageNet and the proposed dataset on crop classification. Features from AGM and ImageNet are used for a downstream task on healthy and stress detection using a small annotated validation dataset obtaining 0.97%, 0.93%, and 0.92% best accuracy with the AGM features. We compare with standard datasets like Cassava, PlantDoc, and RicePlant obtaining significant accuracy. This research contributes to improved crop quality, prolonged shelf life, and optimized nutrient content in vertical farming, enhancing our understanding of abiotic stress management.
@inproceedings{sama2023a, author = {Samà, Nico and David°, Etienne and Rossetti, Simone and Antona°, Alessandro and Franchetti°, Benjamin and Pirri, Fiora}, title = {A new Large Dataset and a Transfer Learning Methodology for Plant Phenotyping in Vertical Farms}, booktitle = {IEEE/CVF International Conference on Computer Vision Workshops}, month = oct, year = {2023}, pages = {540-551}, doi = {10.1109/ICCVW60793.2023.00061}, }
NeurIPS
Hierarchy-Agnostic Unsupervised Segmentation: Parsing Semantic Image Structure

Simone Rossetti^†* and Fiora Pirri^†*

In Advances in Neural Information Processing Systems. More Information can be found here. , 2024

Abs DOI Bib HTML PDF Supp

Unsupervised semantic segmentation aims to discover groupings within images, capturing objects’ view-invariance without external supervision. Moreover, this task is inherently ambiguous due to the varying levels of semantic granularity. Existing methods often bypass this ambiguity using dataset-specific priors. In our research, we address this ambiguity head-on and provide a universal tool for pixel-level semantic parsing of images guided by the latent representations encoded in self-supervised models. We introduce a novel algebraic approach that recursively decomposes an image into nested subgraphs, dynamically estimating their count and ensuring clear separation.The innovative approach identifies scene-specific primitives and constructs a hierarchy-agnostic tree of semantic regions from the image pixels. The model captures fine and coarse semantic details, producing a nuanced and unbiased segmentation. We present a new metric for estimating the quality of the semantic segmentation of discovered elements on different levels of the hierarchy. The metric validates the intrinsic nature of the compositional relations among parts, objects, and scenes in a hierarchy-agnostic domain. Our results prove the power of this methodology, uncovering semantic regions without prior definitions and scaling effectively across various datasets. This robust framework for unsupervised image segmentation proves more accurate semantic hierarchical relationships between scene elements than traditional algorithms. The experiments underscore its potential for broad applicability in image analysis tasks, showcasing its ability to deliver a detailed and unbiased segmentation that surpasses existing unsupervised methods.
@inproceedings{rossetti2024hierarchy, author = {Rossetti, Simone and Pirri, Fiora}, title = {Hierarchy-Agnostic Unsupervised Segmentation: Parsing Semantic Image Structure}, booktitle = {Advances in Neural Information Processing Systems}, year = {2024}, doi = {10.52202/079017-3139}, }