Cléa Aumont - Symposium Presentation on Survey Methods

A comprehensive presentation on modern survey methodology and statistical analysis techniques presented at an academic symposium. This work covers survey design principles, sampling methods, questionnaire development, response bias mitigation, and advanced statistical analysis approaches for survey data. The presentation includes practical examples and case studies demonstrating best practices in survey research methodology.

A three-level approach to classifying ethnicity

1.0 Introduction

The classification of ethnicity is a live issue that remains both complex and controversial. Ethnicity data is an important variable for medical research for multiple reasons. First, it is used as a descriptor for socioeconomic backgrounds which in turn plays a role in forming social stratifications. Second, it can be used to monitor health disparities both directly and indirectly in ethnic subgroups and ethnic minorities. Third, it can be used to provide a more tailored delivery of health services; as a proxy of certain unmeasured social factors; and as a risk indicator for health status and outcomes [1]. Overall, ethnicity data contributes to improving public health and policy and reveals that disparities in health outcomes do exist.

The study of race and ethnicity is not an exact science, and simply defining these terms is not easy. The terms race and ethnicity can be used interchangeably but are subject to a degree of nuance. Race is a much more controversial term, and a socially constructed one; it describes a person's physical features and skin color. Ethnicity on the other hand uses a set of cultural markers such as language, religion, tribe belonging, ethnic community, or territorial origins to describe a person's ethnic background [UN ref]. This paper will use the term ethnicity to refer to "race and ethnicity" as ethnicity encompasses the meaning of race. Ethnicity is a much more general term: it carries a different meaning to different groups of people due to cultural differences. It is important to note that the American perspective on ethnicity may not apply to others in a global context, so when conducting international based research, ethnicity must be framed in a way that individuals around the globe can identify with.

It is important to realise that ethnic classification should be a dynamic process rather than a static one. With increases in immigration patterns and individuals with multiethnic backgrounds, this task becomes increasingly complex [1]. The simple and few categories that respondents used to identify with are no longer representative. This classification must be constantly redefined to match the essence of the time. In order to understand the complexity of this and reform these classifications, free format text reporting is ideal to collect ethnicity data. As Kaufman puts it, "people are who they say they are […] and may not correspond to any of the set of choices that researchers have fixed in advance" [2]. Free format text reporting allows one to truly self-identify, without the constraints of a predetermined set of choices that could be arbitrarily picked or rooted in bias. However, categorizing free format text responses is difficult: comparison is complicated by the variable nature of the responses.

In order to use ethnicity data most ethically and effectively, we must address two key questions: how can self reported ethnicity data be classified in a way that is useful and ethical and is there a better way to approach survey-based ethnicity reporting. This project was conducted in the realm of COVID-19 research and uses data captured from self-reported survey responses. This paper proposes a three-level classification of ethnicity by using multiple levels of granularity to describe a person's ethnic background and also addresses survey design. The taxonomy and programming aspects of this task will both be addressed.

2.0 Methods & Analysis

Adequately using ethnicity data relies heavily on the quality of ethnicity data available [8]. There are various ways of collecting ethnicity data with respect to three different aspects: questionnaire format, variables used, and allowance of multiple ethnicities. Some surveys or questionnaires will have a pre-defined set of categories to choose from, some will have free format text responses, or some, a combination of both. In terms of variables, race and ethnicity are often reported as a single variable, but some questionnaires will report these as two separate variables, and even include a third language variable [4]. Some questionnaires will allow multiple ethnicities to be selected while some others will limit this choice to one, or default to the "minority" ethnicity. This paper processes ethnicity data collected through free format text responses, encodes this data as one variable, and allows for multiple ethnicities. The guideline on race and ethnicity reporting from the Jama Network provides ethical considerations, controversies, example, sensitivities and concerns on the topic, and will be thoroughly used throughout this paper [5].

2.1 Dataset Used

The dataset that was used to build this classification tool came from the International COVID-19 Awareness and Responses Evaluation Study, which is referred to as the iCARE study (www.icarestudy.com). The iCARE study assesses public behaviour in response to COVID-19 public health policies, and how these impact people around the globe. More specifically, it assesses "[t]o what extent are COVID-19 attitudes, beliefs, and concerns associated with adherence, and how does this vary across key subgroups […] such as ethnic groups." [BMJ ref] The iCARE global dataset was built from a collection of surveys that began in May 2020 and will continue until January 2022. The survey was offered in 34 languages and collected data from over 100,000 individuals across 191 countries. These surveys collect self-reported information about COVID-19 related variables, as well as personal information, including country and ethnicity, from individuals across the world. Specifically, the ethnicity variable was collected as an open question, offered in all of the available survey languages. This data was captured using Convenience Sampling. More information can be found elsewhere (BMJ paper as reference).

2.2 Ethnicity classification Process

Essentially, the goal of this classification tool is to normalize ethnicity data in a hierarchal manner to make it simpler to navigate and analyse. Instead of having a single ethnicity variable, there would be three: these will specify different values (if available) at three incremental levels of granularity. This provides a high-level generalization of a person's ethnicity while still preserving the information they reported. When a researcher focusses on a specific ethnic group, they can easily navigate through the ethnicity variables and select certain categories and rejecting others.

We leveraged the United Nations classification of countries to form our three-level classification [UN ref]. The UN statistics division provides a three-level geographical breakdown of continents, sub-continents, and countries. Any reference to a country or an ethnic community native to a specific region could be generalized twice: to a broader sub-continent, and then to a continent.

Our approach for this classification was to begin with no initial categories and create them as they were detected. This allows a person to be classified as they expressed themselves instead of trying to fit them into a set of pre-determined categories. We accomplished this by grouping into a new category all the responses that indicated the same concept. By this, we mean responses that are essentially the same but with slight spelling variations: this could be in different languages (Hispanic or Hispanique); in masculine (Latino), feminine (Latina) or gender neutral (Latinx) form; with spelling mistakes; or synonymous words (Caucasian or White). This is referred to as a safe assumption, where the responses are not being generalized, and nothing is being assumed or inferred from the given response. Every time the classification algorithms runs, it returns a list of ethnicities that have not yet been categorized. This list is then manually processed, and new categories are created, or the values are added to already existing categories.

The next step is to generalise these answers by applying the sub-continent level and the continent level of the UN classification. Any reference to a country (e.g. Cameroon), nationality (e.g. Cameroonian), or tribe native to a specific country (e.g. Bamileke) would be generalized to a sub-continent (Middle African), and then to its continent (African). In this process we ensured that we were generalizing without making assumptions. For instance, we don't generalize a response such as Bantu to Middle African, because this would assume the person's background. In other words, the lower-level granular observations must still be encompassed by their high-level generalization: in this case, although Bantu peoples are an ethnic group commonly found in middle Africa, this ethnic group is actually spread throughout various sub-continents of Africa. There is no way of safely assuming which sub-continent someone with a Bantu response would be from that the safest generalization would be African. For responses that refer to overlapping regions and continents such as Caucasus or Middle East, these responses do not fit into the existing UN categories so new categories are simply made.

Responses referring to ethnic groups or skin color such as Black, White, or Hispanic, had to be given their own categories as well. These responses, however, could not be given a three-level classification, as they were already general and did not relate to geographical location. Finally, any responses referring to religion (e.g. Catholic, Muslim) or language (e.g. Urdu Speaking, Francophone) were given their own categories. Certain religious responses could be generalized, such as Catholic, Christian, Evangelist, all stemming from Christianity, however religious classification is not the focus of this paper.

Figure: Categorization process for the ethnicity variable. Note: not sure if this chart should be kept since the other flow chart is more detailed. The other flowchart is more about programming decisions, this one is more about how the categories were made.

2.3 Programming tools

In terms of programming, this application was developed in Python, with the help of a number of libraries including Pandas, NumPy, and Fuzzy Matching. This code can be found at https://github.com/cleaaum/iCARE_Sorting_Ethnicity_Country. The main data structures used were python dictionaries – these contain a mapping of an ethnic group or category to a list of strings that represent different ways of referring to that category. The dataset containing the ethnicity data is stored in a dataframe and a series of functions are applied to the ethnicity-response column to transform the data into a classified three-level format. This is done through vectorization; a method that doesn't use any loops to process the data to improve the code's runtime and efficiency.

The following flowchart illustrates the programming process. In step one, the ethnicity response is capitalized, stripped of all punctuation and non-alphabetical characters. In step two, the normalized response is looked up in the ethnicity data dictionary, returning a list of all matches found. If nothing was found, then it attempts to look up the response again, but with fuzzy matching (using the Levenshtein Distance algorithm, with a match ratio accuracy set to 92%) in order to consider similar words and spelling mistakes. This is only done if no match is found previously since this step is computationally expensive. The fourth step adjusts the list of ethnicities found. For example a response that reads "I am Taiwanese (Asian)" will be matched with an ethnicity list of [Taiwanese, Asian] by the algorithm. We simply must remove the second element (Asian) from the list since Taiwanese is more descriptive than Asian. This occurs in a number of cases and is useful to remove redundancies. Once step four is complete, this value [Taiwanese] then becomes the very first level of granularity. We then go to step five to generalize this value to get the second level of granularity, in this case it would be [Eastern Asian]. Finally, we generalize once more to get the third level of granularity [Asian] in step 6. Step seven will simply print out a list of all the ethnicity responses that were not able to get categorized. These can then be added manually to existing categories, or new categories can be created.

3.0 Results

Looking at the variability in the nature of the iCARE survey responses, ethnicity means one of four things to most respondents. It could either mean a reference to a territorial origin or nationality, an ethnonym, a religion, a language, or any mix thereof. It becomes evident that the definition of ethnicity varies amongst individuals; and in some cases, responses indicated that the term ethnicity didn't make sense to them.

Case Study: example of 4 responses that are different in nature.

Overall, this classification provides 350 distinct categories at its first level, around 60 categories at its second level, and around 30 categories at its third. It is important to note that this process is dynamic, and that these levels are expanding as more data is being processed and as ethnic diversity increases. This classification was based off of 35,000 ethnicity responses in 34 languages and from 191 countries. Within this dataset, the male to female ratio is 29% male to 71% female, and the average age of the respondent is 43. In terms of country demographics 34.5% of the respondents are from Canada, 7.6% are from France, and 4.7% are from the US. The following radial tree diagram visually represents this hierarchical structure, with these three distinct levels.

4.0 Discussion

4.1 Using results

The classification proposed allows any person analysing the data to pick and choose certain categories and group them together. This means that an ethnic subgroup must be defined by the researcher, and the categories they choose to include are based on assumptions from the researcher. As an example, looking at an Arab or Middle Eastern ethnic group, the categories labelled Western Asia, Northern African, and Middle Eastern may be grouped together to be studied as a whole. In this way, the assumptions can be clearly stated, and can be individually made by each researcher. These three levels provide a high-level generalization of a person's ethnicity while still keeping most of the information reported. Addressing ethnicity on three levels of granularity prevents the loss of a person's true identity, but also makes it easier for researchers to navigate the data and state their assumptions.

4.2 Survey Design

Now that the importance of ethnicity data in identifying health disparities has been established, and that a classification has been proposed, we can turn to survey design and focus on how ethnicity should be collected. This classification process has revealed that ethnicity has more dimension than a simple checkbox can capture. By definition, ethnicity uses several cultural markers to describe a person's ethnic background. This is made apparent in the classification process, as the nature of the responses has been either a reference to an individual's geographical location, religion, native language, and/or ethnic group. In order for ethnicity to be representative of individuals across the globe, it could be useful to break down the ethnicity question into a set of four questions, to capture the full essence of a person's ethnic background. This may help to normalize ethnicity, guide those who may not know what is meant by ethnicity, and turn an ambiguous question that carries different meanings to different people, into a set of precise questions. This becomes crucial when conducting research on a global scale.

4.3 Conclusion

The classification proposed is not a complete solution to the classification of ethnicity as the data collected could not offer an equal comparisons between responses. However, adapting survey design to collect all dimensions of a person's ethnicity paired with this classification process could be promising. It could capture a fuller background of a person's ethnicity, would make fairer comparisons between the responses, and make it easier to navigate through ethnicity data to focus in on specific ethnic subgroups. The key to understanding a person's ethnic background comes from looking at all facets of ethnicity, especially for global level research, and maintaining an overview of why they might be while still preserving who they truly identify as.

References

Mays, Vickie M., Ninez A. Ponce, Donna L. Washington, and Susan D. Cochran. "Classification of Race and Ethnicity: Implications for Public Health." Annual Review of Public Health 24 (2003): 83–110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1361221/.
Kaufman, Jay S. "How Inconsistencies in Racial Classification Demystify the Race Construct in Public Health Statistics." Epidemiology 10, no. 2 (1999): 101–3.
Centers for Disease Control and Prevention. "COVID-19 Hospitalization and Death by Race/Ethnicity." Last modified August 18, 2021. https://www.cdc.gov/coronavirus/2019-ncov/community/health-equity/racial-ethnic-disparities/disparities-hospitalization.html.
Flanagin, Annette, Tracy Frey, and Stacy L. Christiansen. "Updated Guidance on the Reporting of Race and Ethnicity in Medical and Science Journals." JAMA 326, no. 7 (2021): 621–27. https://jamanetwork.com/journals/jama/fullarticle/2776936.
Ahuja, Tejinder S., and Kevin C. Abbott. "Race and Ethnicity in Kidney Disease." Kidney International 98, no. 3 (2020): 545–46. https://www.kidney-international.org/article/S0085-2538(20)30532-9/fulltext.
Mitchell, Kelly W., Lisa A. Carey, and Jeffrey Peppercorn. "Reporting of Race and Ethnicity in Breast Cancer Research: Room for Improvement." Journal of Clinical Oncology 25, no. 24 (2007): 3577–78.
Ma, Irene W. Y., Nadia A. Khan, Anna Kang, Nadia Zalunardo, and Anita Palepu. "Systematic Review Identified Suboptimal Reporting and Use of Race/Ethnicity in General Medical Journals." Journal of Clinical Epidemiology 60, no. 6 (2007): 572–78.

Symposium Presentation on Survey Methods

Technologies Used