Cleaning our way to a monster dataset

In February of 2013 the newly put together OERRH team completed the humongous task of creating a bank of survey questions which would be one of the main research instruments to collect data around the project’s eleven hypotheses. Bear one thing in mind: at the time, each of us was working with a different collaboration –OpenStaxSaylor AcademyFlipped Learning Network,OpenLearnTESS-IndiaCCCOER, etc.; initially, each collaboration was allocated a different hypothesis, which also meant a different pick of questions from the survey bank and a different version of the survey. I’ll give you a couple of examples: our collaboration with the Flipped Learning Network originally focused on teachers’ reflective practices so flipped educators never answered questions on the financial savings of using OER; students using OpenLearn were not asked about the impact of OER on teaching practices; informal learners did not have questions that related to formal studies, and so on. In addition, collaborations had a stake in the research and input in the design of their survey: questions were discussed further, tweaked, piloted and tweaked again ahead of launching. All in all, we put together 18 different questionnaires. The idea was always there to merge all data into one massive file (what I called the MONSTER) that would allow us to undertake comparative analysis. What follows is the official record of how I laboriously coded, recoded, corrected, deleted and cursed (a bit) through the OERRHub surveys in order to have a squeaky clean dataset.

SurveyMonkey and SPSS don’t talk to each other that well

Every researcher knows that there are errors and inaccuracies that need to be ironed out before you commit yourself to analysing quantitative data. We are all human, right? On this occasion, for the first complication that came my way, I’m gonna blame the software: when exporting data from SurveyMonkey as anSPSS file, your variable labels and values will get confused. Let me explain: say you want to find out about OER repositories, so you create a list in SurveyMonkey and ask respondents to tick options from it to answer the question ‘Which OER repositories or educational sites have you used?’. If you expect the list to appear as variable labels in SPSS, it won’t. Instead, the software will repeat your question in the Label box and use the name of the repository in the Values box with a value of 1.

SPSS1

As it happens, the wonderful OER researcher Leigh-Anne Perryman had a solution in her bottomless bag of tricks: the question design in SurveyMonkey had to be amended for future respondents to have the option to tick either ‘yes’ or ‘no’ for each of the repositories on the list. To sort out the damage with any data already collected, what needed to be done was manually input the name of the repository in the label box, and give the variable a value of 1=yes and 2=no. Tedious but easy to fix.

SPSS2

Editing the survey questions to include a yes/no answer also served to remedy another software mishap: the fact that SurveyMonkey does not differentiate a blank answer from a ‘no’ answer when downloading results as a SPSS file. On this occasion, the required fix wasn’t quick. I closely inspected the data case by case: if the respondent did not choose any of the options in a particular question, I considered each a ‘missing’ value; if the respondent ticked just one option, the blank answers were recoded into a ‘no’ value.

Another curious instance of having to recode data was spotted by Beck as the two of us marvelled over having responses from a total of 180 different countries in the world: I can’t recall whether this was a default list in SurveyMonkey but for some reason Great Britain and the United Kingdom were given as separate choices. Obviously, these had to be combined into one.

Correcting human errors

I put my hand up. The OERRH surveys aren’t exactly short and sweet. As a result, and this is my own take on the matter, the data suffered. In some cases, respondents provided the demographic information but did not answer anything else; they were deleted from the final dataset. Exact fate met those who selected all options in one question, despite being mutually exclusive –I find it hard to believe that someone is studying in school and getting a degree while doing a postgrad at the same time, don’t you?

I’ve decided that for some respondents it must have been easier to provide an answer in the comments box than reading through all the available options; what other explanation can you find for a teacher who answers the question ‘What subject do you teach?’ by writing ‘Engineering’ in the ‘Other’ field instead of ticking that from the 17 items at his disposal? Duly noted and corrected.

In other cases, for instance, respondents would leave unticked ‘MOOCs’ when asked about what type of OER they use, but then add as an open comment that they studied with Coursera or EdX. These had to be corrected as well.

Although written in English, the OERRHub surveys were distributed world-wide: it is difficult to anticipate where people might find the language a barrier, but here is an example: we used the word ‘unwaged’ to inquire about employment status; several respondents left the option unmarked, but indicated “Unemployed” or “No job” in the comments field. Again, these cases were corrected accordingly.

Merging data

Cleaning data is always painstaking work, especially when you are handling thousands of cases, but let’s face it, it is also mostly uncomplicated. What could have been if not avoided at least attenuated was the trouble that I saw myself in when having to merge the data from the eighteen OERRHub surveys. As days went by, the monster dataset grew fatter and fatter, but my love for my colleagues (and myself) grew thinner and thinner. Why? It is true that each of the individual surveys had to be customised as per collaboration but we researchers were a tad undisciplined: there were unnecessary changes to the order in which options were presented, there were items added and items subtracted, and wording altered without consultation. All this made data merging more time-consuming, cumbersome and fiddly than it should have been.

All is well that ends well though. We have a clean dataset that comprises of 6390 responses and is already producing very interesting results. Here is one of the lessons learnt: if you are dealing with multiple researchers and multiple datasets, nominate a data master: one to rule them all and bind them, although not in the darkness. Darkness is bad, open is good.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s