Friday, May 20, 2011

Experimental Data




This week I first collected 2 extra wikipedia article subjects and created questions for them. I found that finding pre-made questions online on teacher resource cites did not provide enough overlap for subjects that had concise wikipedia article pages, so I made the questions myself again.

Three articles used in trials:

Gravity Well
Tardigrade (Water Bear)
Tercio

Some articles had certain sections removed to be roughly the same length, about 3-4 small sections.

I conducted 4 trials this week on Amazon's Mechanical Turk with each combination of the 2 variables:
  1. quizzes within or not within
  2. random assignment of topic or choice of topic

The results I got actually were the opposite of the pilot experiment, which is discouraging, but I hope there should still be something useful from the data I collected once I analyze it more. The graph below shows the percentage of total correct questions for each of the treatments.

  The figure above shows the distribution of the average time spent reading the article (and answering questions within the article for those treatments), and average time spent on the final quiz, and then average total time spent. It is interesting that the within quiz treatments show a higher total time, even though they have worse performance on the final quiz. Also, the quizzes within trials have more of their time spent reading the article then doing the final quiz, which may provide some insight into why they scored worse than trials with no quizzes within. 

8 comments:

  1. Hey Jon,

    Don't be discouraged that the results from this week are different from your pilot! The results on time are really interesting and, as you increase the number of subjects and implement the stuff we talked about in class to filter out junk turkers--I think your results will be less confusing.

    I'm curious as to why you give the option of choosing the article or not choosing the article. What is your hypothesis there? Also, I think your preliminary graphs provide a lot of information, but I think it would be nice to compare just the quiz-no quiz conditions. For example, I'd love to see a graph of correct answers quiz vs no quiz conditions. From trying to eyeball it, it seems like your no-quiz condition would be barely ahead from these 4 subjects.

    Last, I don't know if you interested, but since you were wondering about time spent on article vs time spent on quiz and how that could affect quiz performance, I think it could be interesting too to see how the embedded quizes affect confidence. Does your current study design measure how confident people are in their quiz answers? Just a thought, could be interesting to see!

    -Melissa

    ReplyDelete
  2. First off...a "water bear" is about a thousand times less badass than I imagined from its name...

    Also I agree with Melissa, I don't think the results are indicative of a whole lot at this point, especially without rigorous hypothesis testing and having error bars on all that. I noticed that the people in the quizzed condition spent more time overall reading the article (and answering the preliminary quizzes), and a lot less time at the final quiz. As you mentioned these people scored worse overall. Perhaps the reason for this could be that people on AMT are sick of taking quiz questions by the end of the experiment, so they rush through the final quiz and thus do not spend enough time to do a good job. Even if their comprehension of the reading material was higher, they might not be displaying it on the final quiz if they are rushing.

    Another possible explanation is that on the quizzed condition, people reading the article only focus on what those quizzes are asking them about, so they get good detail about those questions but skim over the rest. I'm not sure. I guess it's not terribly helpful right now to go about coming up with explanations for phenomena that might not even be occurring when it's all said and done, but just some things that pop out to me.

    ReplyDelete
  3. Yikes!

    What was the size of the test population?
    The only insight that I see from your data is between noquiz and quiz in figure 2. Otherwise, between choice and random either the difference is so small or it is so sparsely distributed that you will need 10x the amount of data to be able to more finely extract a conclusion.

    ReplyDelete
  4. Ravi: water bears are about a thousand time MORE badass than I imagined. Those guys can can reproduce in near-absolute zero temperatures and go 10 years without water.

    These plots need some error bars! Right now it looks like time spent total is basically the same between groups, but that the quiz group spends much less time on the final quiz. Which is a pretty cool result, if if holds up. I wonder if it's because they're more confident about their answers, or, as seems more likely, there's a fixed amount of time people are willing (allowed? What's the time limit?) to spend on the task, after which, if they haven't finished the quiz, they say, "screw it," and just fill out the quiz quickly.

    ReplyDelete
  5. This data is interesting. There is a clear correlation between placing the quiz and the amount of time people are taking to complete the final quiz (if you place the quiz, people are taking less time in completing the final quiz). However, there also seems to be a loose correlation between the time people spent on the final quiz and the number of correct answers they got.

    Does it mean that people are taking the final quiz less seriously and because of that they are getting lesser number of correct answers?

    About choice and random, it is baffling that people are getting more correct answer when they are given random stuff. What was the size of your study?

    I can't think of anything other than "skill difference" that might cause this.

    ReplyDelete
  6. First of all, yeah-- error bars for sure (although you already presumably got the message on that one :). Second of all, I agree that it seems like you either need to find some way to make sure that people have a relatively-equal level of engagement in both conditions (only to the point where it is comparable, not based on making the results identical, obviously, as the test cases are bound to produce different results) because the data as they stand right now are pretty confusing, in terms of I'm not sure what the overall picture / story coming from them is yet. Maybe all of this gets solved with higher data volume-- that would be interesting as well. I guess something that would be very important would be to go through and plot the user behavior in terms of time taken on more granular parts of the test, to see if there are any systematic biases toward the final results, or whether the data you have are really resultant from honest attempts to do the study in its entirety.

    ReplyDelete
  7. I am also surprised to see that random allotment of topic still results in higher percentage of correct answers. I think since your data is coming from Mechanical Turk, you need to analyze a lot more HITS to make sense of the situation. I think it boils down to how motivated/interested/competent the Turkers are. From your data it looks like the ones taking up random topics are doing better, maybe because they are really into it!

    As far as people taking up topics of their choice, maybe want to finish the task fast, thinking they already know the topic well or something, end up doing bad because of this notion of existing expertise/idea that known topics can be done with soon. However, that still doesnt explain the long time spent on reading the task. :|

    Run a few more experiments. Maybe you will find a stronger trend. I myself an curious to see what is going on!

    ReplyDelete
  8. It appears that blogger ate my previous comment, so I will post it again (to the best of my memory) in case it is still helpful.

    I agree with Melissa that it would be useful to ask people how confident they are in their quiz answers. While you may not be able to do this with Turkers, you could try running mini-experiments on people you know to see if there are any noticeable patterns.

    For example, it is possible that people in the quiz group are not trying as hard on the final quiz because they are just fatigued. It is promising that the quiz group definitely appears to be reading more carefully, but maybe they spend so much less time on the final quiz because they are tired of answering questions and thus don't try as hard by the time they reach the end.

    Alternatively, there is the possibility that the no-quiz group spends more time on the final quiz because they are cheating by looking up the answer externally. It would be difficult to prove this, but if you tested people you trust for the no-quiz variation and found that their reading times are (statistically) significantly lower than that of the Turkers, then perhaps this is reason for suspicion.

    ReplyDelete