Abstract In this article, we will discuss how computational social symbol grounding (i.e. how shared sets of symbols are grounded in multi-agent models) can be used to study children's acquisition of word-meaning mappings. In order to use multi-agent modelling as a reliable tool to study human language acquisition, we argue that the simulations need to be anchored in observations of social interactions that children encounter "in the wild" and in different cultures. We discuss what aspects of such social interactions and cognitive mechanisms can and should be modelled, as well as how we intend to anchor this model to corpora containing features of children's social behaviour as observed "in the wild" to mimic children's (social) environment as reliably as possible. In addition, we discuss some challenges that need to be solved in order to construct the computational model. The resulting SCAFFOLD model will provide a benchmark for investigating socio-cognitive mechanisms of human social symbol grounding using computer simulations.