Introduction

The ONTOLISST Babel Machine Project uses the CLOSER repository’s major topics to identify the topic of survey question texts and variables. The source of our data is the following: CLOSER Discovery. (2025). https://discovery.closer.ac.uk. Accessed Feb 10 2025.

The codebook distinguishes 16 major topic areas:

  • 0. Demographics
  • 1. Housing and local environment (Housing and environment)
  • 2. Physical health
  • 3. Mental health and mental processes
  • 4. Healthcare
  • 5. Health behaviour (Health and lifestyle)
  • 6. Family and social networks
  • 7. Education
  • 8. Employment and income (Employment and pensions)
  • 9. Expectation, attitudes and beliefs (Attitudes and beliefs)
  • 10. Child development
  • 11. Life events
  • 12. Omics
  • 13. Pregnancy
  • 14. Administration
  • 15. COVID-19

Our model was trained on English-language question texts, but we encourage you to also submit datasets with variable labels (texts).

You can upload your datasets here for automated coding. The upload requires filling out a form with metadata regarding the dataset. The datasets should contain an id and a text column, with column names in row 1. You are free to add supplementary variables beyond the compulsory ones in the columns following them.

For an example dataset, please use the following file which you can download here.

If the files you wish to upload are larger than 1 GB, we recommend splitting your dataset into multiple parts. If you wish to submit multiple datasets consecutively, please wait 5-10 minutes between each submission.

After your dataset is uploaded and successfully processed, you will receive the coded dataset via email.

If you have any questions or feedback regarding the ONTOLISST Babel Machine, please reach out to us using our contact form. Please note that we can only respond on Hungarian business days.

This service was created within the framework of the OSCARS-ONTOLISST project.

Submit a dataset:

exclamation icon

The non-coded datasets should contain an id and text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. All datasets must be uploaded in a CSV file format with UTF-8 encoding.

Loading...
    Troubleshooting

    If you are experiencing problems with the upload form, or your submission returns an error message (particularly "Something unexpected happened during upload. Please try again later."), please try performing the following steps:

    • If you use an adblocker browser extension, please turn it off for our site. Adblockers may interfere with legitimate functionality, such as the dropdowns on the upload form. (We do not serve ads on the site.)
    • Try turning off your VPN.
    • Try submitting your data from another browser, preferably with default settings.

    If you are still receiving the "Something unexpected..." error message, please get in touch with us via our email address or the contact form. Try to add as much information as possible, e.g., what browser you are using, notable browser extensions, whether you are using a VPN or not, and exactly how you tried to submit the data (for example, you filled out everything but waited 10 minutes before pressing submit).


    The research was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support. We used the machine learning service of the Slices RI infrastructure (https://www.slices-ri.eu/)


    HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:

    Sebők, M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2024). Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach. Social Science Computer Review, 0(0). https://doi.org/10.1177/08944393241259434