cv
Applied Scientist / Research Engineer — Multimodal & Foundation Models · Research-to-System Translation. Curriculum vitae (PDF download available above).
Basics
| Name | Simone Rossetti |
| Label | Applied Scientist / Research Engineer |
| simone[dot]rossetti[at]live[dot]com | |
| Phone | (+39)[space]349[space]105[space]9384 |
| Url | https://rossettisimone.github.io/ |
| Summary | Multimodal and foundation model specialist with first-author publications at NeurIPS, ICCV, and ECCV. Translates research into robust systems: from methodology design (data-efficient training, cross-modal alignment, evaluation frameworks) to implementation (distributed training pipelines, model serving, CI-based validation). Combines evaluation rigor with system robustness. Industrial R&D orientation with four years leading applied research from prototype to field-validated AI systems. |
Work
-
2026.03 - Present Independent
Transition — R&D in embodied AI
Pursuing research and development in embodied AI. Focus: multimodal foundation models (Vision-Language, VLA), large-scale Transformers, probabilistic modeling; embodied AI and generative modeling; system-level pipelines and production-grade intelligent systems.
- Goal: continue R&D in embodied AI; contribute to next-generation AI platforms and intelligent embodied systems.
-
2021.10 - 2026.02 Co-Founder and Lead Applied Researcher
DeepPlants S.r.l.
Defined technical vision for multimodal AI-driven decision-support systems and led a 5-person R&D team, advancing prototypes from TRL1 to TRL5 within 12 months.
- Designed data-efficient training strategies and implemented distributed PyTorch pipelines (Hugging Face, OpenCLIP, xFormers), reducing annotation requirements by 70% while maintaining ≥85% performance across heterogeneous conditions.
- Developed large-scale multimodal evaluation suites (~50K VQA, ~25K structured QA) and established reproducible training and evaluation frameworks with CI-based validation for internal R&D standards.
- Conducted masked pretraining and finetuned vision-language models (Qwen3-VL, LLaMA 3); implemented containerised serving (Ollama, vLLM, SGLang) and agentic pipelines (LangGraph, RAG) with instrumentation for latency profiling.
- Defined technical methodologies and experimental validation strategies for EU-funded research initiatives in AI-driven sustainable agriculture.
-
2021.01 - 2021.10 Research Fellow
AlcorLAB – Sapienza University (DIAG)
Designed and implemented distributed multi-GPU training pipelines for spatiotemporal models on AVA and YouTubeVIS benchmarks.
- Conducted systematic ablation studies to quantify architectural trade-offs under memory and latency constraints, establishing evaluation protocols for cross-dataset robustness.
Education
-
2021.11 - 2025.01 Rome, Italy
PhD
Sapienza Università di Roma
Computer Science Engineering
- Advisors: Pirri F.; Amerini I.
- Thesis: Reducing supervision in semantic segmentation through advancements in Bayesian prior modelling (UNITesi 2025)
-
2019.10 - 2021.10 Rome, Italy
MSc
Sapienza Università di Roma
Artificial Intelligence and Robotics
- Master's thesis on fast instance segmentation and tracking for YouTube-VIS 2021
-
2015.10 - 2019.03 Rome, Italy
BSc
Università degli Studi Roma Tre
Computer Engineering
- Bachelor's thesis on iterative learning control (ILC) algorithm for 2 DOF robotics arm in MATLAB/SIMULINK
Certificates
| DeepLearn '22 | ||
| Advanced Training | 2022-01-01 |
| ICVSS '22 | ||
| Advanced Training | 2022-01-01 |
Publications
-
2026.01.01 CABBAGE: Comprehensive Agricultural Benchmark Backed by AI-Guided Evaluation
Ongoing Work
Rossetti, Simone et al. (2026).
-
2024.01.01 Hierarchy-agnostic unsupervised segmentation: parsing semantic image structure
Advances in Neural Information Processing Systems 37 (NeurIPS)
Rossetti, Simone and Fiora Pirri (2024). pp. 98898–98935.
-
2023.01.01 A new large dataset and a transfer learning methodology for plant phenotyping in Vertical Farms
IEEE/CVF International Conference on Computer Vision (ICCV)
Sama, Nico et al. (2023). pp. 540–551.
-
2022.01.01 Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation
European Conference on Computer Vision (ECCV). Springer
Rossetti, Simone et al. (2022). pp. 446–463.
Skills
| Multimodal & Vision-Language | |
| Vision-Language Models (CLIP, BLIP, Qwen-VL/LLaMA/GPT) | |
| Vision-Language-Action (PaLM-E, RT-X, GR00T-N1) | |
| Segment Anything (SAM), Object Detection, Instance Segmentation, Action Recognition, Multimodal Fusion |
| Learning Paradigms | |
| Weakly- and Unsupervised Segmentation | |
| Self-Supervised (DINO, SwAV, SeLA), Contrastive (SimCLR, MoCo, BYOL) | |
| Masked Modeling (BERT, MAE), Diffusion, VAEs, GANs |
| Training & Distributed Optimization | |
| Multi-GPU (PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM TP) | |
| LoRA, PyTorch Lightning, Hugging Face, xFormers |
| Evaluation & Benchmarking | |
| Benchmark Design, Ablation Studies, Cross-Dataset Robustness | |
| Weights & Biases, MLflow, Hydra, OmegaConf |
| Model Serving & Agentic Integration | |
| LangGraph, LangChain, LlamaIndex, SGLang, Ollama, vLLM | |
| Milvus, Qdrant, Multimodal RAG, Docker, FastAPI, CI/CD |
| Tooling & Workflow Automation | |
| Python, PyTorch, TIMM, OpenCLIP, Torchvision, OpenCV, Albumentations | |
| TensorBoard, Git, Linux |
Languages
| Italian | |
| Native (C2) |
| English | |
| Fluent (C1) |