Page last revised: July 19, 2010

BEST PRACTICES - MAPPING HEALTH DATA

Working with area data

Data about individuals are often available only at an aggregated level in order to protect personal information. For example, average income levels for census tracts are readily available, but the income of an individual person in that census tract is usually not available. Similarly, the total number of people with asthma in a health service area might be known, but not each persons’ individual location within that area.

When using area-based data, there are a number of issues to consider:


Ecological fallacy and data analysis

An ecological fallacy occurs whenever a researcher makes assumptions about individuals based on data that have been summarized for areas.

For example, a researcher might examine the relationship between low income and heart disease using data for health regions across Canada. They might find a positive association suggesting that areas with higher percentages of people with low income also have higher percentages of people with heart disease. Concluding that low income is a causative factor for heart disease would be an ecological fallacy. It may be that the people with heart disease are not the same people with low income in any given area, and there is no way of determining this without individual-level data. Studies that use only area-based data are called ‘ecological studies’ and should always be considered as exploratory in nature.

Case examples by Robinson (1950) and Openshaw (1984) are the most well-known and in-depth reviews of the effects of the ecological fallacy when using Census or other health registry data.

There are also studies that combine individual-level data with area-based data. These are commonly called multi-level studies (MLM), and are meant to include effects at the individual level (for example, heart disease) and ecological level (for example, average income for the neighbourhood) for each person in the study. MLM separately analyze the variance between different levels of data (e.g. individual vs. neighbourhood) when analyzing the effects on health outcomes. Theoretically, MLM also allows the researcher to analyze at what level, or scale, variations in individual-level health outcomes are best explained.

Research papers authored by Diez-Roux et al (2000), and Stafford et al (2001) contain excellent review of MLM and provide theoretical as well as analytical summary of its strengths and limitations for population health research.

In many instances, however, data restrictions or availability limit the ability to obtain individual-level data. For example, many microdata sources may not contain all of the ‘causal’ variables that are required for analyzing a particular health condition. In these instances, it may be necessary to link the microdata with a separate dataset that contains a population average that can be used as a surrogate indicator that is otherwise known to be related to the particular health condition.

Census data are some of the most frequently used surrogate measures in health research. Case examples of how to construct proxy measures of individual socio-economic status can be found in the following papers:

When using area-based data, the following points should be considered:

  • Ecological studies can be very useful and informative, especially for developing hypotheses, but to properly assess any associations found in an ecological study, there must be a follow up study using individual-level data.
  • Using area-based (ecological) variables in combination with individual-level data may also create opportunities for ecological fallacies or introduce error to analyses. The only time this does not occur is when the area-based variable truly represents all the people in the area (i.e., there is no individual variation within the area for that particular characteristic)
  • Most area-based data are subject to the modifiable areal unit (MAUP) problem (see below).

For more information on multi-level modelling: http://www.paho.org/English/DD/AIS/be_v24n3-multilevel.htm

Modifiable areal unit problem

A wide range of health-related data are available only in a summarized form in order to protect individual confidentiality. Examples include Census data, health outcome rates for health services jurisdictions, and vital statistics data.

Whenever individual data are summarized for areas, the statistic of interest (total count, percent of low income, and so on) depends on the area boundaries used, and if different boundaries are used, even for the same individual-level data, different statistics can result. This is commonly referred to as the modifiable areal unit problem (MAUP).

This means that analysis results might change, depending on the area boundaries used!

Anaylses using area-based data may also lead to ecological fallacies.

There are no solutions for MAUP, but the following approaches can be useful for minimizing or understanding MAUP effects in analyses using area-based data.

  • Use the smallest possible areas - The underlying reason for MAUP effects is that by summarizing individual-level data, some of the true variability is smoothed away. The larger the areas, the smoother the data will be, so using the smallest possible areas will help to minimize the effects of zoning somewhat. This still does not address the basic aggregation issue, which relates to how the area boundaries are drawn, rather than to the number of areas used.

  • Conduct a sensitivity analysis and report the results - The best scenario would be to test out a number of aggregation or zoning schemes on the individual-level data, but if the individual-level data are available for this kind of analysis, there is no need to summarize! Just use the individual-level data.

    If you only have area-based data, you could do a sensitivity analysis using different scales, for example, do your study with the smallest areas, and repeat using fewer, larger areas to see how the results change. This could mean using data from census dissemination areas, census tracts, and so on. Report whether or not the analysis results were affected.

Rate instability

Comparing rates of health outcomes among different areas is commonly used for disease surveillance purposes, i.e., to identify areas where disease rates are higher or lower than expected. However, incidence rates computed for areas can produce highly unreliable or ‘unstable’ rates, especially when calculated for sparsely populated rural or remote areas, or for rare diseases. Spatial patterns of rates may vary for a number of reasons:

  • the way areas are defined geographically can influence the calculated rates (i.e., the number of cases and the population included in an area can change if the boundary changes). See more on the modifiable areal unit problem.
  • Data errors, including incorrect disease coding, incorrect geocoding, or incorrect estimates of the population at risk (especially between census years) may create false variations.1

When working with rate data for different areas which may be unstable due to small populations, researchers and analysts should consider the following:

  1. Conduct a sensitivity analysis. This may be done simply by investigating how the addition or subtraction of one case affects the rates.
  2. Rate Instability Table Illustration
  3. Use areas with populations large enough to produce stable results if possible. For example, Alberta health developed regions for analysis with at least 20,000 people (view pdf). For rare diseases, combine data from multiple years (i.e., average over five years)
  4. When smaller areas are preferred and there is rate instability, spatial smoothing may reduce rate instability. Simply put, this process averages or combines information from neighbouring areas to create larger counts of cases and populations, which produce more stable rates. Use caution, as there are a number of methods for spatial smoothing, and over-smoothing may hide areas with truly excessive rates.

    • Explore rate smoothing with GeoDa (http://geodacenter.asu.edu/ ). This program runs on Windows XP. Check out the excellent documentation.
    • Richardson et al (2004) suggest that smoothing maps of relative risk may perform better when relative risks are on the order of 2 and expected numbers of cases per area are at least 20.

Useful links:

Population data for BC (includes by age/sex): http://www.bcstats.gov.bc.ca/data/pop/popstart.asp

References:

Carstairs, V. and R. Morris (1989). "Deprivation and Mortality - an Alternative to Social-Class." Community Medicine 11(3): 210-219.

Diez-Roux, A. V. (2000). "Multilevel analysis in public health research." Annual Review of Public Health 21: 171-192.

Macintyre, S., S. Maciver, et al. (1993). "Area, Class and Health - Should We Be Focusing on Places or People." Journal of Social Policy 22: 213-234.

Openshaw, S. (1984). "Ecological Fallacies and the Analysis of Areal Census-Data." Environment and Planning A 16(1): 17-31.

Pampalon, R. and G. Raymond (2000). "A Deprivation Index for Health and Welfare Planning in Quebec." Chronic Diseases in Canada 21(3): 104-113.

Robinson, W. S. (1950). "Ecological Correlations and the Behavior of Individuals." American Sociological Review 15: 351-357.

Stafford, M., M. Bartley, et al. (2001). "Characteristics of individuals and characteristics of areas: investigating their influence on health in the Whitehall II study." Health & Place 7(2): 117-129.

Krieger, N., J. T.Chen, P. D. Waterman, M. J. Soobader, S. V. Subramanian and R. Carson (2002). "Geocoding and monitoring of US socioeconomic inequalities in mortality and cancer incidence: Does the choice of area-based measure and geographic level matter? The Public Health Disparities Geocoding Project." American Journal of Epidemiology 156(5): 471-482.

Nakaya, T. (2000). "An information statistical approach to the modifiable areal unit problem in incidence rate maps." Environment and Planning A 32(1): 91-109.

Soobader, M. J., F. B. LeClere, W. Hadden and B. Maury (2001). "Using aggregate geographic data to proxy individual socioeconomic status: Does size matter?" American Journal of Public Health 91(4): 632-636.

Richardson S, Thomson A, Best N, Elliott P. (2004) Interpreting posterior relative risk estimates in disease-mapping studies. Environmental Health Perspectives 112: 1016-1025.



1. Population counts are extrapolated between census years which can significantly alter rates if the extrapolation is erroneous. For example, one census tract may have experienced significant immigration which increased observed disease rates, but the denominator (population) used to calculate risk may not have reflected this immigration because the count was extrapolated between census years.


< back to Mapping Health Data main page