statistics tutorial

STATISTICS TUTORIAL

STARTING WITH A RESEARCH QUESTION

Statistical tables begin with a research question based on what we would like to know about a particular relationship. In the case of this database, the kinds of questions we asked had to do with relationships of gender and work. For example, we might ask:

“Are part-time workers in Canada more likely to be male or female?”

This tells us that:

1. gender is an important factor here (male or female)
2. we are looking for a form of employment, in this case part-time workers
3. we are only looking at Canadian data
4. we are trying to draw a conclusion based on the relationship between them, in this case who is more (or less) likely to be working part-time

BUILDING A TABLE

To move from a research question to a statistical table, we begin with establishing variables. A variable is a property of something which can have different values. For example, a variable to answer our research question is gender (Statistics Canada uses the term “sex” for this variable). “Sex” can be broken down into two parts: male or female. Each variable should be exhaustive (in other words, it covers all the ways the category can be expressed: male, female, and both sexes), and should be mutually exclusive (males can be differentiated from females).

So, our variables to answer the above question would be:

example of a basic statistical table in the database, employment form by sex

SURVEY USE

Researchers then turn to existing Statistics Canada surveys to answer their research question. There are many Statistics Canada surveys available on a variety of topics (here is an alphabetical list). Some of the Statistics Canada surveys used to create the GWD tables are:

Census of Population
General Social Survey (GSS)
Labour Force Survey (LFS)
Survey of Labour and Income Dynamics (SLID)
Survey of Self-Employment (SSE)
Survey of Work Arrangements (SWA)
Workplace and Employee Survey (WES)

Each survey uses a different sample size. A sample is a smaller group which is selected from the whole. In this case, the whole is every person (or workplace) in Canada. A sample would be a smaller portion of that. Samples, which are randomly selected by Statistics Canada, are used because of the difficulty of surveying every single person in Canada. While samples can be applied to the population as a whole, they are not a perfectly accurate measurements, rather they are estimates of the population. Statistics using samples attempt to generalize results to the whole population. The bigger the sample size, the more likely that the results will apply more accurately to the population.

Each survey has particular strengths and weaknesses. For example, some surveys are one-time surveys, which mean they were conducted only once. Other surveys are conducted regularly. Some surveys offer in-depth information about areas of interest. For example, the Survey of Self-Employment (SSE) provides detailed data about self-employment, whereas the Workplace and Employee Survey (WES) provides detailed information about changes in employer business strategies and in the organization of work.

CHOOSING VARIABLES

The GWD includes many variables related to gender and work, including:

Social location and demographic variables (e.g. sex, age, immigrant status)
Domestic sphere and family variables (e.g. marital status, presence or absence of children)
Time and place (e.g. year, area of residence such as province or Census metropolitan area (CMA))
Workplace variables (e.g. industry, occupation)
Form of employment (e.g. part-time, full-time, temporary, permanent, self-employed)

Variables are cross-tabulated to produce tables that illustrate relationships. Our research question for this tutorial is: are part-time workers in Canada more likely to be male or female? Below is a simple table with data related to this research question:

example of a multidimensional table in the databse listing year, geography, sex and employment form
source: Labour Force Survey, 2008

In this example, the figures in the table represent the total number of employees, both full-time and part-time, male and female. The total is necessary in order to calculate a percentage (such as the percentage of female part-time employees). Out of 2,585,600 part-time workers, 789,700 are male, and 1,795,900 are female.

To calculate the percentages from these numbers, click on the “Total and Percentages” icon on the Table tab, located on the top left corner of the webpage, and scroll down to “Percentage of row total”, we can calculate the percentage of men and women by full-time and part-time employment.

example of obtaining percentage of row totals in the Beyond 20/20 statistical tables

The table below demonstrates the result with percentages calculated:

example of output table displaying the percentage of men and women by employment form

We find nearly 69% of part-time employees are female and approximately 31% are male. From this we can propose that part-time employees are more likely to be female. We can see that it is important to break down labour force data by gender to reveal these distinctive patterns of work.

The multidimensional tables in the GWD are quite complicated where, each table contains up to eight variables or dimensions. The dimensions are subdivided into smaller groups, called items. For example, in the dimension “sex”, the items are “total, male, and female”. Thus, it is useful to have a clear research question before exploring the statistical tables. What relationship(s) do you want the data to highlight?

THE DEFAULT VIEW

GWD tables are organized so that when users open them, the topic of the table is displayed in the most concise format. Below is a sample default view of a table for employment form, including year, geography, age, immigrant status, industry, occupation, union coverage and sex. In the default view, most of the variables are collapsed to their totals.

example of a Beyond 20/20 multidimensional table in default view including nine variables

Once you have opened the table to its default view, you can then modify it as you would like.

MODIFYING THE TABLE LAYOUT

Each statistical table can be modified to highlight certain dimensions and items of interest, in relation to the research question. This section demonstrates how to modify tables.

Rearranging Dimensions

By default, most of the dimensions are organized on the left side of the table. The order of these dimensions can be rearranged to highlight what is of interest in the analysis. There must always be at least one dimension in the row bar and column bar. To rearrange the dimensions, click the “Tasks” tab, scroll down and click “Set dimension order”. This allows users to move and modify the order of dimensions.

example of rearranging the order of variables in a multidimensional table in columns, rows and other

You may also want to select or deselect items to view within a table. Simply click on the dimension of interest, and select (or deselect) the desired items.

example of selecting and deselecting items for table view

Nesting Dimensions

In some cases it may be useful to nest dimensions to achieve a concise visual representation. Instead of comparing laterally across the cells, the “sex” dimension can be nested within the “employment form” dimension.

1. Start by hovering over the “sex” dimension until the cross-arrows cursor appears, click and hold the dimension until it becomes a yellow bar.

example of nesting the dimension sex in between employment form and union coverage

2. While holding the highlighted dimension with your cursor, drag the dimension to the dimension you want to nest it within. To nest, the yellow bar will highlight the tip of the other dimension; the yellow bar shows where it will be moved. Release the dimension to set in place.

example of nesting the dimension sex in between employment form and union coverage

Below is an example of “sex” nested within the “employment form” dimension:

example of table output after nesting the dimension sex in between employment form and union coverage

Charts and Graphs

Beyond 20/20 allows for several data visualization options, including pie, column and bar charts. To create a visualization of data, select the Chart tab located at the top left side of the table, next to the Table tab.

example of chart tab

From the Chart tab, you can modify the chart or graph by selecting the specific type you want to display your data.

drop down menu of different chart and graph options

Below is an example of a column chart:

example of bar chartdisplaying employment form by sex by union coverage

Hiding Dimensions

Unnecessary dimensions may be hidden, or removed, from table view. Before hiding any dimensions, ensure the dimension is set to the total. Start by hovering over the dimension you want to hide until the cross-arrows cursor appears; click and hold the dimension until it becomes a yellow bar. Drag the dimension to the “Other:” bar, located at the top of the table. In this example, “Immigrant” is not relevant and will be removed from the table view.

table example of hiding the dimension immigrant status in the other bar

MISSING AND SUPPRESSED DATA

Specific symbols may appear in table cells to represent missing, suppressed or unreliable data. These symbols are as follows:

A lowercase “x” or capital “F” is used to represent suppressed data. The suppression of cell counts occurs when showing a small number of specific cases would compromise participant confidentiality, or the inability to draw accurate conclusions from the sample (see Confidence Intervals below).

A single period or dot “.” is used to represent missing data.

A capital E in brackets “(E)” is provided for cell counts that are flagged and should be used with caution. Cell counts above the suppression threshold are published in the statistical tables but are flagged for data quality purposes.

Zeros “0” do not necessarily signify the absence of data in the table cell. Most of the time, zeros reflect cell counts too low for the cutoff of the surveys used. Each survey has a cutoff below which it records values as zero. For example, in a survey which uses 1,500 as a cutoff, if only 1,000 data points were collected, then the value for that collection of data would be shown as a zero. Surveys vary in their cutoff levels (see Cutoffs by Survey below). A survey with a higher cutoff might show more zeros in table cells than surveys with a lower cutoff.

CONFIDENCE INTERVALS

Most surveys, since they are based on samples and not the entire population, produce data estimates. Statistics Canada decides to release data based on its confidence that the data estimate produced by studying the sample is an accurate reflection of the larger population. For example, if the data shows that 70% of the sample is employed, and Statistics Canada is confident with the accuracy of this representation to the entire Canadian population, then the data will be released.

A confidence interval is a term that describes a statistic that has a margin of error. Any survey that uses a sample instead of the whole population will have some margin of error. The goal of good data representation is to ensure that this margin is as small as possible, so that there is a high level of confidence that the numbers are fairly accurate.

i. More on Statistics Canada’s general policy on data quality standards
ii. Statistics Canada Data Quality Guidelines

CUTOFFS BY SURVEY

Each Statistics Canada survey varies in its sample size and its rules for suppression. Check the survey documentation to see if you can find more information, such as the Coefficient of Variation tables that are in most microdata file user guides.

Statistics Canada’s policy on release of data categorizes estimates as unqualified (acceptable), qualified (marginal) and not for release (unacceptable).

Unqualified: Coefficient of Variation (CV) of 0.0 to 16.5%
Estimates can be considered for general unrestricted release.

Qualified: CV of 16.6 to 33.3%
Estimates can be considered for general unrestricted release but should be accompanied by a warning cautioning users of the high sampling variability associated with the estimates. The Beyond 20/20 format did not allow us to insert cautions. In this case, it is wise to refer to additional CV tables supplied by Statistics Canada, with which to evaluate the accuracy of the estimates.

Not for release: CV of 33.4% or over
Estimates should not be released in any form under any circumstances. In statistical tables, such estimates are deleted (in the case of the GWD, zeroes appear in the tables).

Along with survey variation, these thresholds can vary by province. For example, here are the cutoffs for various surveys used for tables in the GWD.

MINIMUM SAMPLE SIZE FOR DATA RELEASE, BY SURVEY

minimum sample size by Canadian provinces and Statistics Canada surveys

USING STATISTICAL DATA

It is essential that you are careful in how you use statistical data. First, you should think about what story you are trying to tell with the data. Numeric data does not stand alone. It needs to be interpreted through a conceptual framework which each researcher provides. What questions are you trying to answer with this data? What is the best way to apply the data to your specific needs?

Second, do not assume that because statistical data is numeric that it is truth. Be critical of how the data was obtained, and what types of questions were asked. You can click on the summary icon in the tables to find out more about how the variables were developed, and/or what questions were asked to obtain the results. You might also want to examine the number of people who answered “do not know” or did not answer the question in order to assess the reliability of the data.

You should also ensure that your method of calculation is correct. For example, if you are trying to calculate a percentage from the table (such as what percentage of all part-timers in the workforce are women), make sure that you are using the correct total.

For more on using statistical data, refer to Statistics Canada’s Definitions, Data Sources, and Methods section.

*Note: You must apply for access to the statistical tables of the Gender & Work Database. See the apply for statistics access page for eligibility requirements and other details.