You are on page 1of 6

Statistical Analysis using OpenOffice

Preparing the workspace

When opening comma separated data files, make sure that the data preview window looks correct. You may need
to make sure that the correct separators are selected (Figure 1). Upon opening or creating a spreadsheet of data
you should see something like what is shown in Figure 2.

Figure 1: Import preview, selecting separators

Figure 2: View when data are first loaded

One extremely useful feature of a spreadsheet application is the availability of multiple “sheets”. The sheet shown
here is called “Sheet1”. Right click on the sheet tab (bottom left of screen) and rename the first sheet to simply
“Data”. Then create a new sheet called “BasicStats” (Figure 3). Using multiple tabs allows us to separate the
presentation and summary elements from the data. This helps keep us from being distracted by the large table of
data and also keeps the original data safe from accidental modification since we will spend very little time working
on the actual “Data” sheet itself.

1
Figure 3: Created a new sheet and save in native format

Since we intend to make use of some advanced spreadsheet features it is a good idea to make sure that the
document is saved in the native format for our application (Figure 3 — notice that we may need to change the
“File type”).
We can access values on different sheets by referring to the sheet’s name. This allows us to structure our document
logically and consistently, even if we choose to change column names or data values in the future. On the
“BasicStats” sheet we duplicate the column headers by referring to the headers on the first page. Type =Data.D1
into the D1 cell on the “BasicStats” sheet and press enter (Figure 4). If we later decide to change the header
labels on the first page, that change will be duplicated on our “BasicStats” sheet.

Figure 4: Referencing another sheet and the magic behavior of pasting

When you copy and paste a formula in a spreadsheet the referenced cells behave somewhat magically. On the
“BasicStats” sheet select the D1 cell (notice that the formula in the formula bar still says “=Data.D1”) and copy
that cell by pressing Ctrl - c (at the same time). Then click and drag so that cells E1 through T1 are selected
(still on the “BasicStats” sheet). Paste your formula into these cells by pressing Ctrl - v (Figure 4). Notice that
if you select one of the other cells the formula was changed to reference the corresponding cell on the “Data”
sheet.

Standard Numerical Summaries

We are now able to easily compute all of the basic numerical summaries. We start by labeling each row (just type
the label text into the corresponding cell) and compute the mean of the “not English” column (column D) which
has data in rows 2 through 237 (Figure 5). We can similarly use the MAX, MEDIAN, MIN, PERCENTILE, and STDEV
functions to compute the other values (Figure 5).

Figure 5: Computing numerical summaries

2
We can now easily compute all of the other summaries by simply copying these seven cells and pasting them into
the remaining columns. Furthermore, since the spreadsheet can do standard arithmetic as well, we can compute
some of our own values such as the LOW and HIGH values that arise from the 1.5 IQR rule (Figure 6). Notice
that we do not need to precede the cell number with the sheet name when referring to a cell on the same sheet
we are working on.

IQR = D8 - D6
1.5 IQR = 1.5 * D11
LOW = D6 - D12
HIGH = D8 + D12

Figure 6: Applying the 1.5 IQR Rule

Analyzing a single variable

Calculations

I recommend creating a new sheet for each variable you wish to analyze. As an example, I have created a new
sheet titled “NotEnglish”. The description for column D is “the percentage of people who do not speak English
at home.” Therefore those values are percentages of the total population of each county which is held in column
T. If we wanted to compute the (approximate) number of individuals who do not speak English at home in Ada
county (row 2 of our data) we would need the following computation.

number not english = total population × percent not english (as decimal)
= value in column T × value in column D/100
= Data.T2 * Data.D2/100

We place this computation into cell A2 on our “NotEnglish” sheet then copy and paste that formula into rows 3
through 237 (select them all at once before pasting!). This tells us the total number of individuals (per county) who
live in a household that does not speak English. We could then SUM (add) those values to approximate the total
number of individuals in the entire country who live in a household that does not speak English (Figure 7)1 .

Figure 7: Counting the number of individuals who do not speak English at home
1
If a cell ever shows “###” then the column is not wide enough to display the value. If this happens, simply make the column
wider by dragging the dividing line in the header row.

3
Histogram

Creating a histogram in a spreadsheet first requires constructing a frequency table. Looking at the 5 number
summary for the “not english” column helps us decide how to break up our bins for the histogram. Since our
minimum value is 2.7 and our maximum value is 78.4 we will choose to start at zero and take steps of 5. This
should (potentially) give us 17 bars in our histogram. We start by typing by hand some headers and the values
we chose to round to (Figure 8 — ignore the highlighted “frequency” column for now).

Figure 8: Frequency data for non-English households

Now that we have decided how the data should be grouped, we need to tell the spreadsheet to actually compute
the frequencies. Highlight the set of cells where you wish the frequencies to be placed (in this case we are putting
them in cells D8 through D24). When we type in the FREQUENCY command we need to tell it what data are
of interest as well as what bins to use. Here we want the “not english” data and the bins that we just typed
so our command will be =FREQUENCY(Data.D2:D237; C8:C24). Now, instead of pressing enter as usual
press Ctrl - Shift - Enter all at once2 . This should fill in the frequencies as show in Figure 8. Notice that
OpenOffice rounds data to the nearest value rather than truncating down (our minimum, 2.7, was counted in the
5 bin rather than the 0 bin).
To create the actual chart, highlight the two columns (just the numbers) we just created and select “Insert” →
“Chart”. In the wizard dialog that pops up make sure that the “First column as label” box is checked. Select
“Next”, make sure that a normal column-type bar chart is selected (select “Next” when appropriate). Make sure
that your chart has an appropriate title and disable the legend. This will create a histogram which is ready to be
copied into a document as part of a report. If your title font needs adjusted (for instance your title is too long
and you need to make the font smaller), right click on the title3 and select either “Object Properties” or “Title”
→ “Main Title”.
2
Even though one says “all at once” you may hold down the Ctrl and Shift keys and then press enter.
3
You may need to first double-click the chart to change into chart editing mode.

4
Figure 9: Completed histogram

Boxplot

Begin by arranging the 5 number summary in the order shown in Figure 104 . Select the data and labels (6 rows
and 2 columns here, though it is possible to plot more than one boxplot simultaneously by adding more columns).
In the menus select “Insert” → “Chart”. Create a “Stock Chart” of type 2 with “Data Series” in “Rows”, and
“Grid lines” disabled for both the X and Y axis. Disable the legend and change and adjust the title. Once finished
you should have a boxplot that is missing the median line (Figure 10). Unfortunately, I know of no way to insert
the line for the median so we must do without that.

Figure 10: Completed boxplot

Two-Variable Statistics

Fortunately for us, spreadsheets are a bit more friendly when it comes to two-variable statistics. We can compute
the correlation between a pair of variables using the CORREL function. Create a new sheet with the name “above 65
vs men never”. We will be comparing columns N (percent above age 65) and O (percent men never married). In our
new sheet we compute the correlation by typing =CORREL(Data.N2:N237;Data.O2:O237) and find an extremely
poor association (r = −0.25).
4
I included my boxplot on my “NotEnglish” sheet just below my histogram (I scrolled down a page) so that all of my “not english”
analysis can be found on the same sheet.

5
Pressing on, we can draw a scatterplot by switching to the “Data” sheet, highlighting columns N and O (simply
by selecting the N and O headers), and then clicking “Insert” → “Chart”. On the first page of the chart-making
wizard we can tell OpenOffice to place the chart into our “above 65 vs men never” sheet. Create an “XY Chart”
with “Symbols Only”. As usual, give the chart a meaningful title and disable the legend5 .
We may also add a regression line by selecting “Insert” → “Statistics” and then choosing “Linear regression”6 .
You can add a subtitle by clicking “Insert” → “Title”.

Figure 11: A complete scatterplot

5
My version of OpenOffice asks to sort the X axis values. I do not see that selecting “yes” or “no” makes any difference so feel free
to ignore that message.
6
Again, you may need to first double-click the chart to change into chart editing mode.

You might also like