Gold-standard sleep staging as performed by human experts has only limited agreement (83%), resulting in uncertainty about sleep statistics such as the number of awakenings. Current automatic scoring methods are able to match human experts in per-epoch scoring performance but are not able to model the uncertainty about sleep statistics. We propose a novel method based on generative deep learning that can model this uncertainty. We evaluate our method on a dataset of 70 subjects, each scored by 6 human experts, showing a good fit in terms of both per-epoch sleep stage scoring and the uncertainty about the sleep statistics.