I ran a small pilot experiment this week just testing two conditions. The experiment was testing knowledge retention of users that are given quizzes during reading an article vs. no quizzes during reading. The test of knowledge retention was with a quiz at the end with different questions than any in the middle.
I started my work by learning php, and mocking up some simple quiz pages using php. I created a 5 page workflow like this: instructions -> reading wiki page -> quiz questions page -> survey questions page -> completion page.Users could only move through in a linear matter, and times on each new page visit are recorded. User responses to quiz questions are also recorded, and written out along with their user session ID to a file on the server. php sessions are used to keep track of the user throughout the process.
I deployed the two versions (with and without internal quiz) on mechanical turk, getting 10 responses per trial.
The hypothesis is that users will perform better in the final quiz if they have the treatment with quizzes interspersed in their article. The null hypothesis would be that both treatments produce the same end scores.
Assuming a 0.8 desired statistical power (a good guess based on wikipedia reading...) and a Cohen's d value of ~0.2 (I expect the mean of average test performance to differ by about this much, and expect the std. deviations to remain the same between the two conditions), I should be looking at a sample size of about 400 for statistical significance. Since this is just a pilot trial and samples cost money on mechanical turk, I opted to use a smaller sample size this time.
I think I want to use a t-test to statistically analyze my results, since this compares multiple groups that follow a Gaussian distribution, which should be the case for test scores (with enough questions). A chi-squared test would not work because there are not two straight answers.
More on pilot results later...
No comments:
Post a Comment