At train-time, we embed each pixel of the ground truth image SI as the mean of predefined guide functions f over instance pixels it belongs to, resuling in embeddings e(S, Ψ). We then train the neural network E to reproduce the ground truth embedding given the input image I. To simplify learning, guide functions f are inputed into intermediate representations of the network using SinConv layers. The learning process uses a simple pixelwise L1-Loss between ground truth embed- ding e(S,Ψ) and the neural network prediction E(I,θ) as a learning objective. At test time instances are retrieved from the predicted embedding E(I, θ) using mean shift clustering.