CS303 Experimental Notebook

Friday, May 20, 2011

Experimental Data

This week I first collected 2 extra wikipedia article subjects and created questions for them. I found that finding pre-made questions online on teacher resource cites did not provide enough overlap for subjects that had concise wikipedia article pages, so I made the questions myself again.

Three articles used in trials:

Gravity Well
Tardigrade (Water Bear)
Tercio

Some articles had certain sections removed to be roughly the same length, about 3-4 small sections.

I conducted 4 trials this week on Amazon's Mechanical Turk with each combination of the 2 variables:

quizzes within or not within
random assignment of topic or choice of topic

The results I got actually were the opposite of the pilot experiment, which is discouraging, but I hope there should still be something useful from the data I collected once I analyze it more. The graph below shows the percentage of total correct questions for each of the treatments.

The figure above shows the distribution of the average time spent reading the article (and answering questions within the article for those treatments), and average time spent on the final quiz, and then average total time spent. It is interesting that the within quiz treatments show a higher total time, even though they have worse performance on the final quiz. Also, the quizzes within trials have more of their time spent reading the article then doing the final quiz, which may provide some insight into why they scored worse than trials with no quizzes within.

Friday, May 13, 2011

List of things to do

Here is a list of things that must be done, in the order I intend to complete it and date by which they must be done:

Migrate code to my own server to avoid using stanford web space (fix timing issue) - 5/14
Find 2 more combinations of quizzes and subjects for a Wikipedia page to my testing trials - 5/16
Convert those quiz/subject pages into web pages in my work flow to test users with - 5/16
Add in a new page with a pre-quiz for a new treatment of trials for each subject - 5/17
Start running a larger experiment with 20-30 users per treatment on AMT - 5/18
Process the results for the trials and present to the class - 5/20
Depending on if there is a large variation in time taken to complete the reading and quizzes in each treatment, I may add in another set of treatment pages using visibly prominent javascript timers with a set small amount of time on either the reading portion or quiz portion or both. This would be purely aesthetic, but I would be testing if the artificial deadline effects performance on time and correctness. Time to add the timer - 5/24
Run another set of AMT trials with the timer present - 5/25
Process the results of the new set of trials and test the effect of the visible timer - 5/27

Tuesday, May 10, 2011

Update

Incites gained from the last class discussion:
1. Multiple different wiki pages can be used - search for premade quizzes on subjects on teaching resource websites.
2. Mechanical Turk can impose restrictions on who is allowed to complete your request, and you can pay them based on their quiz responses to give them incentive to actually answer correctly and put some thought into it rather than just skipping through.
3. Measuring the timing accurately is difficult but important to both judge if the experimenters are actually performing the task and to compare against treatments.

Unfortunately, I have been sick for the past week+, so not much progress has been made. I would like to outline here my plans of the next week or so.

Changes I need to implement:
1. Modify the php to accurately keep track of sessions to record the correct time of completion for each step of the process.
2. Add on 2-3 more wikipedia pages of different subjects to add to the single page I have now. Possibly write a script to automate the conversion of wikipage into a quiz page. Match this with quizzes found for known subjects online so I don't make up the quiz questions myself.
3. Add another treatment with a pre-quiz given instead of or in addition to the quiz in the middle of the article.
4. Possibly add in a javascript timer to the pages that visually imposes some time limit for reading and completing quizzes, and use the time allotted for reading and for quiz completion as a variable in the trials.

Friday, April 29, 2011

Pilot Experiment

I ran a small pilot experiment this week just testing two conditions. The experiment was testing knowledge retention of users that are given quizzes during reading an article vs. no quizzes during reading. The test of knowledge retention was with a quiz at the end with different questions than any in the middle.

I started my work by learning php, and mocking up some simple quiz pages using php. I created a 5 page workflow like this: instructions -> reading wiki page -> quiz questions page -> survey questions page -> completion page.Users could only move through in a linear matter, and times on each new page visit are recorded. User responses to quiz questions are also recorded, and written out along with their user session ID to a file on the server. php sessions are used to keep track of the user throughout the process.

I deployed the two versions (with and without internal quiz) on mechanical turk, getting 10 responses per trial.
The hypothesis is that users will perform better in the final quiz if they have the treatment with quizzes interspersed in their article. The null hypothesis would be that both treatments produce the same end scores.

Assuming a 0.8 desired statistical power (a good guess based on wikipedia reading...) and a Cohen's d value of ~0.2 (I expect the mean of average test performance to differ by about this much, and expect the std. deviations to remain the same between the two conditions), I should be looking at a sample size of about 400 for statistical significance. Since this is just a pilot trial and samples cost money on mechanical turk, I opted to use a smaller sample size this time.

I think I want to use a t-test to statistically analyze my results, since this compares multiple groups that follow a Gaussian distribution, which should be the case for test scores (with enough questions). A chi-squared test would not work because there are not two straight answers.

More on pilot results later...

Buddy Post

My Buddy is Ravi, and his blog is located here: http://cs303blog.blogspot.com/

The information you’ve gathered on the psychological reasons and nature behind procrastination was really interesting to me. I personally can completely relate to the findings that more choices or more difficult tasks can induce the most procrastination. I really like that your hypothesis is clean, concise, and easily testable – which my hypothesis is sorely missing right now.

Some feedback on your experimental design as listed: I see for each time a user goes to a greenlisted site, you are measuring if it has latency added, and if the user closes the tab or goes to another tab before the tab loads, and then recording the length of the users visit to that tab. I can see some potential areas of concern in the going to different tabs, and in the length of users visit. The length of a user visit to any website, including a procrastination website, can depend heavily on many factors completely outside your study, such as the users mood, time of day, how much time they have, how many other things they need to do, outside interruptions / stimuli, and even how much new page content there is on the site since they last checked. So I don’t know how accurate recording how long someone is on a site would be between the two conditions. For measuring if the latency prompted users to go to other tabs to wait for it to load, I feel like something like this could be very dependent on user preference. Personally, if a site I am trying to go to is taking more than a few seconds to load, I will automatically go to another tab and do another short task, and then go back to the tab in a few seconds or minutes to check that it loaded and use it. This changing of tabs doesn’t mean I won’t go to that site, it just means I go to a different site (often a different procrastination site even) while waiting for it to load. This could even result in more procrastination…if the user opts to open two sites because one is too slow.

Also, what will you do about sites like facebook, which auto-update periodically if they are left open. Many people I know browse facebook using the feeds, and just leave facebook open all the time and just tab back to it to check any new stuff in the feed. Will this type of browsing behavior still be recorded in your study? Or are you not allowing users to keep tabs to multiple procrastination sites open at the same time as a control for this?

One possible modification to your experiment to get around some of these issues is to maybe use the rate at which users go to procrastination sites / how many they go to in the time period, and couple this with giving different users different rates of adding in the latency or not – and see if the rate of adding in latency has any correlation with rate of going to procrastination sites. But this still does have the issue of you needing to measure a baseline for each person of how often they go to these types of sites normally. Even though procrastination varies greatly week to week, maybe if you measure long enough and collect a lot of data you can get around the weekly variations.

Your project is looking great and I am excited to see the results. Hopefully you can cure the procrastinator in us all!