Quark-versus-gluon tagging in CMS Open Data with CWoLa and TopicFlow

We use the CMS Open Data to examine the performance of weakly-supervised learning for tagging quark and gluon jets at the LHC. We target Z+jet and dijet events as respective quark- and gluon-enriched mixtures and derive samples both from data taken in 2011 at 7 TeV, and from Monte Carlo. CWoLa and T...

Full description

Saved in:
Bibliographic Details
Main Authors: Dolan, Matthew J. (Author) , Gargalionis, John (Author) , Ore, Ayodele (Author)
Format: Article (Journal)
Language:English
Published: August 4, 2025
In: Journal of high energy physics
Year: 2025, Issue: 8, Pages: 1-37
ISSN:1029-8479
DOI:10.1007/JHEP08(2025)024
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.1007/JHEP08(2025)024
Verlag, kostenfrei, Volltext: https://link.springer.com/article/10.1007/JHEP08(2025)024
Get full text
Author Notes:Matthew J. Dolan, John Gargalionis and Ayodele Ore
Description
Summary:We use the CMS Open Data to examine the performance of weakly-supervised learning for tagging quark and gluon jets at the LHC. We target Z+jet and dijet events as respective quark- and gluon-enriched mixtures and derive samples both from data taken in 2011 at 7 TeV, and from Monte Carlo. CWoLa and TopicFlow models are trained on real data and compared to fully-supervised classifiers trained on simulation. In order to obtain estimates for the discrimination power in real data, we consider three different estimates of the quark/gluon mixture fractions in the data. Compared to when the models are evaluated on simulation, we find reversed rankings for the fully- and weakly-supervised approaches. Further, these rankings based on data are robust to the estimate of the mixture fraction in the test set. Finally, we use TopicFlow to smooth statistical fluctuations in the small testing set, and to provide uncertainty on the performance in real data.
Item Description:Gesehen am 08.12.2025
Physical Description:Online Resource
ISSN:1029-8479
DOI:10.1007/JHEP08(2025)024