The ONTOLISST Babel Machine

Introduction

The ONTOLISST Babel Machine is designed to classify survey questions and variable texts into predefined categories.

CLOSER model

The CLOSER model applies the major level of the CLOSER codebook, which distinguishes 16 topic areas:

0. Demographics
1. Housing and local environment (Housing and environment)
2. Physical health
3. Mental health and mental processes
4. Healthcare
5. Health behaviour (Health and lifestyle)
6. Family and social networks
7. Education
8. Employment and income (Employment and pensions)
9. Expectation, attitudes and beliefs (Attitudes and beliefs)
10. Child development
11. Life events
12. Omics
13. Pregnancy
14. Administration
15. COVID-19

MAJOR model

The MAJOR model operates with the major level of the LiSST thesaurus, developed within the ONTOLISST project. Its categories are:

1. Background: BACKGROUND INFORMATIONS ON THE RESPONDENT
2. Environment: ENVIRONMENT AND RESIDENCE
3. Health: HEALTH, CARE, SOCIAL SERVICES
4. Work: EMPLOYMENT, WORK AND DUTIES
5. Education: EDUCATION AND QUALIFICATION
6. Family: FAMILY AND SOCIAL NETWORK
7. Values: VALUES, ATTITUDES AND PARTICIPATION
8. Leisure: LEISURE, MEDIA AND CULTURAL CONSUMPTIONS
9. Income: ECONOMY, INCOME AND EXPENDITURE (WEALTH)
10. Legality: LAW AND LEGALITY
11. Technical: TECHNICAL
12. Other: CANNOT BE CLASSIFIED

Training data

The models were trained on English-language survey question texts and variables. The primary data source is:

CLOSER Discovery. (2025). https://discovery.closer.ac.uk. Accessed Feb 10 2025.

In the case of the major model, additional metadata from other research archives was also incorporated.

Uploading

You can upload your datasets here for automated coding. The upload requires filling out a form with metadata regarding the dataset. The datasets should contain an id and a text column, with column names in row 1. You are free to add supplementary variables beyond the compulsory ones in the columns following them.

For an example dataset, please use the following file which you can download here.

If the files you wish to upload are larger than 1 GB, we recommend splitting your dataset into multiple parts. If you wish to submit multiple datasets consecutively, please wait 5-10 minutes between each submission.

After your dataset is uploaded and successfully processed, you will receive the coded dataset via email.

If you have any questions or feedback regarding the ONTOLISST Babel Machine, please reach out to us using our contact form. Please note that we can only respond on Hungarian business days.

This service was created within the framework of the OSCARS-ONTOLISST project.

Submit a dataset:

Processing Unit:

Name*

E-mail address*

Institution name*

Institution country*

Dataset name*

Level of dataset*

Geographical unit*

Dataset country*

Dataset language (language of data in text column)*

Ontolisst Domain*

Unit of observation*

Period (from)*

Period (to)*

Use case*

Description

The non-coded datasets should contain an id and text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. All datasets must be uploaded in a CSV file format with UTF-8 encoding.

Dataset in .CSV format:
Choose file

Codebook:
Choose file

I have read the instructions and rules.

By ticking the checkbox I declare that this data does not contain any personal data and I am responsible for the legitimacy of the origin of the data. The uploaded data is suitable for text mining and I voluntarily provided the uploaded file(s) to poltextLAB. I also grant permission to poltextLAB researchers to make the data visualisation of the data I uploaded publicly available. In addition, I agree that the uploaded data may be stored on servers within the European Union and be used for data analysis and training purposes without time limit.

By ticking this checkbox, I consent to the processing of the personal data I provide in connection with the file upload (e.g., name, email address) for the sole purposes of returning the processed files and identifying the ownership of the uploaded data. I understand that my personal data will be securely stored and used exclusively for these purposes, in compliance with the purpose limitation principle (Article 5(1)(b)) and the data minimisation principle (Article 5(1)(c)) of the GDPR. I am also aware of my rights under Article 17 (Right to Erasure (‘right to be forgotten’)) and Article 15 (Right of access by the data subject) of the GDPR, and I may request the deletion or access to my personal data at any time (see full GDPR Compliance Statement below).

Troubleshooting

If you are experiencing problems with the upload form, or your submission returns an error message (particularly "Something unexpected happened during upload. Please try again later."), please try performing the following steps:

If you use an adblocker browser extension, please turn it off for our site. Adblockers may interfere with legitimate functionality, such as the dropdowns on the upload form. (We do not serve ads on the site.)
Try turning off your VPN.
Try submitting your data from another browser, preferably with default settings.

If you are still receiving the "Something unexpected..." error message, please get in touch with us via our email address or the contact form. Try to add as much information as possible, e.g., what browser you are using, notable browser extensions, whether you are using a VPN or not, and exactly how you tried to submit the data (for example, you filled out everything but waited 10 minutes before pressing submit).

This project was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project; the V-Shift Momentum Project of the Hungarian Academy of Sciences; Miklós Sebők's Excellence project (identifier: 151324), which is funded by the Hungarian National Research, Development and Innovation Office's National Research Excellence Programme; and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support. We used the machine learning service of the Slices RI infrastructure (https://www.slices-ri.eu/).

HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:

Sebők, M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2025). Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach. Social Science Computer Review, 43(2), 295–317. https://doi.org/10.1177/08944393241259434

GDPR Compliance Statement

Nature of the Uploaded Data: The files uploaded by users to the tool do not contain personal data as defined in Article 4(1) of the GDPR, which specifies personal data as "any information relating to an identified or identifiable natural person ('data subject')".
Data Process: The files submitted to our tool are stored in a secure cloud environment to allow processing and generation of the output (the coded CSV file). Personal data provided in connection with the file upload—such as the submitter's name, email address, and similar details—are used exclusively for the purpose of sending the coded files back to the user and identifying the organisation of our users. This processing is conducted in compliance with the purpose limitation principle (Article 5(1)(b)) and the data minimisation principle (Article 5(1)(c)) of the GDPR. By submitting the files, the user consents to this data processing, which is strictly limited to returning the results and identifying the file owner. The personal data is stored securely and retained solely for these purposes. In accordance with Article 17 of the GDPR (Right to Erasure, or "Right to be Forgotten"), users may request the deletion of their personal data at any time. Such requests will be processed promptly, and all related personal data will be permanently deleted from our systems.
Training Purposes: We do not use personal data to train machine learning models or perform any other type of analysis. When submitting files, the submitter must declare that the uploaded CSV files do not contain any personal data, as stated in the consent agreement. This approach aligns with the purpose limitation principle (Article 5(1)(b)) of the GDPR, which requires data to be collected for "specified, explicit, and legitimate purposes" and not further processed in a manner incompatible with those purposes.
Google Cloud Platform Compliance: The files submitted to our tool are stored in a secure cloud environment provided by Google Cloud Platform, with configurations ensuring that all processing occurs on servers located within the European Union (EU). This guarantees compliance with GDPR requirements related to data residency and cross-border data transfers. The use of Google Cloud Platform as our processing environment ensures high levels of data security and compliance with GDPR, including the application of the Standard Contractual Clauses (SCCs) for any necessary data transfers. Google Cloud's infrastructure is certified under internationally recognised standards, such as ISO 27001, ISO 27017, and ISO 27018, further ensuring the security and confidentiality of uploaded data.