Lab42 Research

View Original

5 Best Practices for Cleaning Survey Data

If you’ve used online sample recently, you know that it can be a mixed bag of quality. 

On one hand, there are the respondents who really care about sharing their opinions and take their time to answer thoughtfully. 

On the other hand, you have the respondents who leave poor quality responses by speeding, straightlining, and having logical inconsistencies (have you ever met an 18 year old who has a PhD? 

Us neither. 

And even beyond the poor quality respondents who are human, we’ve seen an increasing number of generative AI bots in online sample. 

These responses are even harder to identify. 

While making sure you have the correct audience for your study is important, it's just as necessary to thoroughly review responses and incorporate updated data quality checks and traps to detect respondents who may not be fully engaged with the questions or, frankly, may not be human. 

Below are the top 5 best practices for data cleaning. Some of these practices require manual quality checks while others could be automated quality checks or traps set up prior to a survey launch.

1. Red Herring Questions / Options

Including red herring options is a great practice to have in place to catch respondents who are speeding through the survey, most likely not reading questions. 

Red herrings can be options that are added into grids or matrix questions to catch respondents who are selecting at random. An example being if you were asking a grid question about respondents’ health and eating habits. 

Within the set grid of statements, add an option that says, “select 4. Somewhat agree”. This way, those who are selecting options at random will merely glance over this trap resulting in a survey termination.

This can also work for single select and multiple select questions. Asking respondents questions that should be easy to answer or even directly asking them if they are paying attention.

 You’d be surprised how many people we catch and terminate from our surveys for not paying attention and simply clicking through these grids.


2. Straight Lines and Select All That Apply

A straight line is what we refer to as a respondent who selects the same response for each grid question. They may only select “5. Strongly agree” as they assume that is what the researcher would like to hear. 

However, it is somewhat possible that a respondent may just simply “strongly agree” with all statements. 

A good way to catch respondents in these situations is including statements that work as opposites: “It’s important to me to get outside and exercise at least once a day” versus “Exercise and going outdoors is not important to my everyday life” are opposite enough statements to not initiate a ‘5. Strongly agree’ response from a respondent on both statements.

“Select all that apply questions”, or multiple select questions, are an easy way for a respondent to try and qualify for a study that they do not necessarily fall into the scope for. 

Try using opposite terms or situations that would be highly unlikely for respondents to relate to all options. For example, “How does this ad make you feel?” could have options such as ‘happy, sad, angry, excited, etc.’. 

If a respondent is selecting all these options, it can be assumed that they are not paying attention or unaware which option will allow them to qualify or continue to move along in the survey. 

Another example is, “Which of the following illnesses have you had in the past 3 months? – COVID-19, pneumonia, influenza A, strep throat”

“Which of the following have you done in the past 3 months? – Take a trip abroad, piloted a helicopter, saw the Grand Canyon, went skydiving”.

3. Time TAKEN

Completion time is a quick and easy way to catch respondents who are speeding through the survey and not fully reading questions or thinking about their responses. 

Before launching your survey, time yourself taking the survey and use that time as a reference when reviewing respondent completion time. Give respondents leeway as some may read and comprehend much faster. 

However, if the survey takes the survey writer 15 minutes to complete but a respondent completes it in 3-5 minutes, it is safe to assume they rushed their responses and they may not be very high quality.

4. Open-ended Questions

We always recommend adding 2-4 open-ended questions for data quality purposes. 

Stay away from questions that could result in yes/no answers. 

Ask respondents to elaborate on their selection and to be as specific as possible. 

Towards the end of the survey, you can also ask the respondent to identify the topic of the survey. Despite being a simple question, their answers will indicate if they were actually paying attention to the survey. 

Reviewing open-ended data allows you to catch AI bots repeating answers or rambling nonsense not related to the questions. It also allows you to catch respondents who are not paying attention, rushing through with simple answers, or just typing scrambles of letters as their response. “Cool” “it's fine” “asdfghjkl;”. 

To avoid short answers, and if possible, add a minimum character count on your survey platform to encourage respondents to leave more than one-word responses.

5. Improbably Similar Demographics

Reviewing similar demographics is a newer approach our team has started to take that allows us to catch respondents who may be taking the survey more than once. (Bonus tip: first check for IP duplicates to remove any repeat respondents). 

Our team will add a zip code question which helps us look for duplicates. Of course, some respondents could be from the same zip code and not a duplicate. 

However, that’s highly unlikely that there are multiple respondents from the same zip code that are also, white, non-Hispanic, female, age 21-28, divorced, with 3 kids, 2 dogs, 1 cat, and 70k household income. 

Once these duplicates are flagged, if you are still wary that they may be different respondents, you can compare their open-ends and selections throughout the survey to see if you can find more similarities which confirm its status as a duplicate response.

Offering some of the highest data quality is something Lab42 prides itself on. Too many sample providers lack intelligent and adaptive tools to identify and stop poor quality respondents from tainting your research results.

We’ve seen an increasing number of generative AI bots making their way into research studies, and if you don’t have the proper protocols set up, you could be left with results that at best, don’t make sense, and at worst, are just plain wrong.