You are on page 1of 41

Cash Incentives, Peer Tutoring, and Parental Involvement: A Study of Three Educational Inputs in a Randomized Field Experiment in China

Tao Li (PKU) Li Han (HKUST), , Scott Rozelle (Stanford), Linxiu Zhang (CCAP) October 21, 2010

Abstract We explore the relative eectiveness of three educational inputs (a cash incentive for grades, incentivized peer tutoring, and parental communication) in a randomized eld experiment imposed on a set of under-performing primary school students in China. We nd that the cash incentive alone had no impact on actual learning. The incentivized peer tutoring intervention raised the standardized test score by 0.14 standard deviations. An integrated strategy involving all three factors was most eective, having an impact of 0.20 standard deviations. This is partly because peer tutoring and parental communication work as complementary inputs in the underlying production function for learning.
Tao Li (corresponding author), Peking University, Mail: HSBC Business School, Peking University Campus, University Town, Nanshan District, Shenzhen, 518055 China. Tel: 86-755-26032485, Fax: 86-755-26035344. Email: litao@post.harvard.edu.

Introduction

Broadly speaking, three main factors aect learning: teaching quality, parents, and student motivation. Understanding how student achievement is aected by these factors and how they interact with one another is important to research in education. One program currently in vogue relies on the notion that student motivation alone can improve academic performance a paying-for-grades program. Given the popularity of the concept in the last decade many such programs were implemented around the globe there is a pressing need for synthetic evaluation.1 Paying-for-grades programs adopt an unconventional approach to education that is appealing in that it bypasses the complexities of schools and families and goes directly to the children. Such a strategy that sidesteps traditional educational institutions appears promising because there is little evidence that spending more money on traditional educational inputs may enhance learning (Hanushek, 1995; Glewwe and Kremer, 2006). However, relying solely on student motivation to remedy education also seems logically unrealistic; in this light, a fruitful research agenda might want to evaluate cash incentives relative to other factors of learning, and to come up with a more integrated and eective intervention strategy. This paper reports the design and results of a randomized eld experiment that encompasses elements from each of the three factors of learning. About 850 under-performing migrant children in Beijing were oered cash incentives to improve their academic performance in the fall semester of 2009. In addition to this basic paying-for-grades intervention, 50 percent of the students in this sample were randomly selected to receive an incentivized peer tutoring intervention, while another 50 percent were randomly selected to
Examples include Girls Scholarship Programme in Kenya (Kremer, Miguel, and Thornton, 2009), High School Matriculation Awards in Israel (Angrist and Lavy, 2009), Educational Maintenance Allowance in the UK (Middleton et al, 2005), Cal-Learn (Mauldon et al, 2000), Monthly Grade Stipend (Spencer, Noll, and Cassidy, 2005), Advanced Placement Incentive Program (Jackson, 2009), etc.
1

receive an additional intervention in which parents were informed of their childs participation in the other intervention (henceforth, parental communication). Randomization into these two interventions was independent of each other. This compound design (a basic intervention + a cross-cutting design) aims to rigorously evaluate the eectiveness of cash incentives relative to the other two interventions, and with respect to the control students (see Table 1). Our results show that cash incentives alone are not an eective way of enhancing learning. If we add parental communication to the cash incentive intervention, the impact is still not signicant. However, students who received the peer tutoring intervention saw their test scores improve by about 0.14 sd on average. This eect appears to be even higher for some subgroups, including academically weak students, female students, and students in higher grades. Both tutoring and formal teaching serve the same function: helping students to understand course material. Paying-for-grades does not aid in learning it only provides students with new incentives to earn good grades on their own. Our results imply that new incentives alone are not enough; bypassing the educators may be the key reason why the paying-forgrades intervention did not work. Only when we provided extra learning assistance in the form of tutoring was there a signicant eect on learning. Moreover, there is clear evidence that the tutors did not suer academically. We also nd that tutoring and parental communication acted as complementary inputs in the production function for learning. The marginal eect of tutoring (and of parental communication) is enhanced in the presence of the other factor. Because parental communication involves little cost, it is in eect a free input, suggesting that improving tutoring (or teaching) may have spillover eects on other interventions. Our results show that a more integrated approach involving students, teachers/tutors, and parents is most eective. In our project it had an eect of about 0.20 sd. The power of paying-for-grades programs appears to lie in their emphasis on student motivation. When designing our tutoring intervention, we bor-

row this idea. Instead of strengthening teaching, which is often a dicult task, we use cash incentives to motivate peers to provide quality tutoring. Tutors in our experiments were eligible to compete for a prize in a tournament. The chance of winning a prize and its size were both tied to the test score improvements of the tutees (each of whom was paired with one of the tutors). Extensive research in educational psychology has demonstrated peer tutoring to be an eective teaching strategy (see a review in Topping, 2005), although evaluations have mainly been conducted in industrialized countries. Besides being conducted in a developing country, our peer tutoring intervention diers from the traditional format in two respects. First, we use cash incentive contracts for the tutors. Material rewards have rarely been used in educational psychology (instead relying on teachers to implement them as part of their job). To our knowledge, incentivized payments have never been used, certainly not on such a large scale.2 Second, we emphasize student incentives instead of carefully structured tutoring protocols or supervised tutoring sessions. This shift in focus is important as it paves the way for our program to be easily implemented in poor countries where traditional peer tutoring programs may fall victim to weak teacher incentives and/or poor teaching skills. It is rare for economists to evaluate several factors of learning simultaneously in rigorous impact evaluation studies in primary or secondary schools.3 Most papers only study cash incentives that seek to motivate students themselves (paying-for-grades programs). Evidence from such programs is mixed (see a review in Slavin, 2009). The main results of the largest paying-forgrades experiment to date, performed in 261 American public schools, also suggest that cash incentives alone cannot easily improve performance (Fryer,
We have another working paper on a slightly dierent incentivized peer tutoring program based on a much smaller sample size. 3 Analyzing an experiment carried out in a Canadian university, Angrist, Lang, and Oreopoulos (2009) nd that both cash incentives and traditional peer advising services are more eective in improving learning than oering only one of them, and oering only peer advising service is not eective.
2

2010). The Advanced Placement Incentive Program (APIP) in Texas provided cash incentives to both teachers and students for each passing score earned on an Advanced Placement (AP) exam, similar to the way in which we oered rewards to both tutor and tutee. In a matched evaluation, Jackson (2009) found that the adoption of APIP was associated with a 13% increase in students scoring above 1100 on the Scholastic Achievement Test (SAT). The setup of the APIP, however, did not allow researchers to distinguish whether the impact was due to the response of students or teachers. The rest of the paper is organized as follows. Sections 2 and 3 describe the program and evaluation design. Sections 4 and 5 report on the main program eects. Section 6 conducts robustness checks. Section 7 concludes.

The Fall Challenge Program

Our eld experiment, known as the Fall Challenge Program to students and teachers, was implemented by our trained project managers in a set of randomly chosen migrant schools around Beijing in the fall semester of 2009. Despite being located in one of Chinas richest cities, these schools specically serve poor migrant families and more closely resemble schools in underdeveloped rural areas (Lai et al., 2010). Improving education quality in such schools has important implications for Chinas 150 million migrant workers now and undoubtedly even more in the future as the migrant population continues to grow.

2.1

Intervention Design

For simplicity, we refer to all the intervention students (N = 856) who did not receive the tutoring intervention as participating in the paying-for-grades program or simply the pay program. We refer to the other half of the intervention students as participating in the peer tutoring program or simply the tutor program. The treated students in the pay and tutor programs are called payees and tutees, respectively. 5

Each program class hosted exactly one program and had about 10 treated students. Let us rst introduce the pay program. Payees in each pay program class were oered a paying-for-grades contract. We promised to pay a reward of 100 RMB (about 13 US dollars, or about one-third to one-quarter of their tuition for a semester) to the student who achieved the greatest increase in test scores between the baseline test (taken in September 2009) and the evaluation test (taken in January 2010). Second and third place runners up were promised 50 RMB each. In total we oered 200 RMB for the 10 treatment students in each pay-program class, to be split among three winners. This level of compensation is equal to about 20 to 25 percent of a teachers monthly compensation in migrant schools. We also promised a public ceremony and ocial certicates for the winners. Our paying-for-grades intervention was run as a tournament. This diers from other cash incentive programs, which typically reward all participating students on the improvement they have made in absolute terms or on the condition that they reach a certain target (i.e., a linear contract). Even though we used standardized tests for both the baseline and the evaluation tests, it is technically dicult to design the tests in such a way that the dierence between the two test scores can correctly measure improvements resulting from our semester-long intervention, not to mention maintaining a uniform standard for students from dierent grades who would have to take dierent tests. The use of a tournament system simplied the test design process and allowed us to oer more generous rewards while still remaining within our budget. Moreover, because we implemented the tests and did all the grading ourselves, an absolute reward system would have made it impossible to convince the teachers and students ex ante that we would not increase the diculty level of the tests or implement a stricter grading policy to save ourselves cost. Use of a tournament system eliminates this potential for grading bias. The 10 treated students in each peer tutoring program class (tutees) were oered the same tournament contract as payees. In addition, each tutee was

assigned one of the top students from his or her class to serve as a tutor. To encourage peer tutoring, we promised to award the tutor a cash prize of the same amount as his or her tutee. As a result, our budget for a tutoring class was 400 RMB instead of 200 RMB. Under a cross-cutting design (see Table 1), half of the tutor-tutee pairs, randomly drawn, and half of the payees, also randomly drawn, were additionally assigned to receive a telephone communication intervention directed at their parents. The purpose of the intervention was to re-explain the nature of the intervention in which the student was participating, and to deepen the parents impression of the program.

2.2

Random Assignment

Random assignment of our program was executed at the school level, at the class level within the schools of our sample and at the individual level within the treatment classes (contingent on test scores). The detailed school, class and individual assignment process is illustrated in Figure 1 (for a typical class of 40 students). From a (relatively complete) list of 340 migrant schools in Beijing we randomly chose 23 schools to participate in our study. Our study focused on students in grades 3 through 6. Each grade typically has one to three classes. All students belonged to one and only one class. In migrant schools there is typically no switching classes during the semester. According to our data, there were no students that attended one class during the baseline and another class during the evaluation survey. Within each school, we randomly picked 4 to 6 classes. We made sure that each grade in each school had at least one class included in the study. In no grade did we pick more than two classes from the same school. In total we enrolled 126 classes into our study. Every student in these classes participated in a baseline test (pre-test) and survey, and an evaluation test (post-test). Based upon the results of the baseline test and survey, we divided the 7

schools (and classes/students) into those that participated in the paying-forgrades program and those that participated in the peer tutoring program. Specically, we randomly chose 12 schools to host the paying-for-grades program. The other 11 schools hosted the peer-tutoring program. We designed a random assignment algorithm to make sure that the pre-test scores of math and reading scores, gender variables and grades of the students, together with non-missing responses to nine other survey questions were balanced across these two groups of schools. In almost all schools, we randomly selected four classes to implement the hosted program; the remaining one or two classes served as control classes.4 We ended up with 44 peer-tutoring classes from 11 schools, and 47 payingfor-grades classes from the other 12 schools. There were 35 control classes from across all 23 schools. We used the same random assignment algorithm to make sure that the same pre-treatment variables were balanced among students in the three types of classes (see Table 3).5 Our study included two types of control students: those in the control classes; and those in intervention classes who were not chosen to be active participants in the intervention. While nobody in the pure control classes received any type of treatment, the individual-level assignment in the intervention classes was more complex. According to our experimental design, we treated only half of the 20 poorest-performing students (that is, ten in total), leaving the other half (that is, the other ten) as a second group of control students. To implement this strategy, we initially ranked all students on the basis of their combined scores from the standardized math and reading tests taken during the baseline. For a typical class of 40 students, we then divided the class into quartiles the top 10 students in quartile 1; the next 10 in quartile 2; and the poorest performing students (ranks 21 to 40) in the bottom two quartiles. The students in the bottom two quartiles were combined
In a few schools with only four classes, all classes implemented the hosted program. Detailed randomness checks at the school and class level are omitted to save space. A rough check is provided in Table 3.
5 4

into one group of twenty (henceforth, the bottom half).6 While all students in the bottom half were eligible for treatment, not all of them were treated. In half of the intervention classes, we picked students in the bottom half with odd-numbered rankings (21, 23, . . . , 39) to participate in the program. The even-numbered students in the bottom half were left untouched and act as control students. In the other half of intervention classes, we selected the even-numbered students in the bottom half to act as treatment students instead, and left the odd-numbered students untouched (to function as control students). One illustration of assigning students to the treatment and control groups is described in Table 2. The above individual assignment was the same for both peer tutoring and paying-for-grades classes, with one signicant dierence. The top 10 students in the peer tutoring classes became peer tutors, while the top 10 students in the paying-for-grades classes did not participate actively in the intervention. In the peer tutoring classes, each tutee was matched with one tutor from the same class. The matching was random (see evidence in Section 6.3).

2.3

Program Implementation

In late August, 2009, according to the methodology discussed above, we chose 23 schools and 126 classes from these schools to participate in our study. In early September our enumerators implemented a standardized test in these 126 classes at the beginning of the semester. This baseline test had two parts: math and reading. All classes in the same grade used the same test. Dierent grades used dierent tests. All tests were designed by a national educational expert from Beijing Normal University. All test questions were multiple-choice. The exam was designed to look like exams that students in Chinese schools are accustomed to taking. We also conducted a baseline survey before the test which asked students to answer basic background questions about themselves and their families. Each test room had one teacher
Because the real class size could be slightly above or below 40, what we actually did is dividing students into three groups: top 10, bottom 20, and the middle students.
6

and one enumerator who acted as exam managers, and one or two additional enumerators who walked around the test rooms as monitors. After randomly assigning peer tutoring, paying-for-grades and control classes and students based on the baseline test (as described above), in the middle of September, our enumerators returned to the 23 schools and implemented the interventions. Our enumerators summoned the chosen students (payees, tutees and tutors) in each program class to the headmasters oce and announced the program to them in the presence of their teachers and headmasters. The purpose of announcing the program in such a formal setting was two-fold. First, we wanted to only contact treated students in the intervention classes and to avoid contact with the other students. Second, we wanted to make our oer sound credible. The intervention was described as a competitive scholarship program being conducted by a renowned government research institute that was aimed at boosting the academic performance of tutees and payees. We told the members of the assembled groups (according to a predetermined, standardized text that was read by the leader of each enumeration team) that because of funding limitations we could only select some students to participate. They were encouraged to challenge themselves to improve as much as possible by the next test scheduled at the end of the semester. Enumerators avoided labeling the students as underperforming students needing particular help. Specically, the tutees and payees did not know that they were from the bottom half of the class. In peer tutoring classes, tutors and their tutee were required to sit on the same bench so that they could interact frequently. Our enumerators ensured that the teachers made the necessary seat assignment changes according to a list we provided. At the end of the meeting we also gave students an ocial letter (with an ocial stamp from our research institute) describing our motivation and the key points of the program. The explanation t on one piece of paper. Students were required to ask their parents to sign and return the letters if they wanted their children to participate in our programs. Students who did not want to participate themselves were allowed to exit. Fewer than ve of

10

the parents and students (out of about 850 treatment students) declined our oer. The parental communication intervention consisted of two parts. In the middle of November, our enumerator team called all of the parents of the treatment students who were assigned to the communication group. The enumerators read a pre-determined standard message to parents. The content of the message essentially repeated what was in the ocial letter. In addition, as the second part of our parental communication intervention, since almost all of the parents used mobile phones, two weeks before the evaluation test, we sent them a short text message reminding them of the approaching evaluation test for the rewards. Our enumerators returned to the schools to implement a standardized evaluation test in early January 2010. The tests were designed to be similar to the baseline tests, taking into account that students had received an additional semesters worth of education. The tests were graded shortly after the evaluation survey and awards and certicates were distributed to the students in ocial ceremonies held in each school. In total, 27,000 yuan (or about 4,000 U.S. dollars) was distributed.

2.4

Data Description

The data for our entire sample are described in Table 3, by dierent class types. The variables pre-test and post-test represent the standardized scores on the baseline and evaluation tests, respectively.7 The Male variable is the binary gender dummy. There are slightly more males in the sample (about 55%), an imbalance almost always found in schools attended by students from rural areas in China. The Grade variable takes the value of 3, 4, 5, or 6, and serves as a proxy for age and educational attainment. Male, pre-test, and Grade are the so-called pre-treatment variables in our analysis. In part
Standardization (for math and reading separately) was done in each grade because dierent grades used dierent tests. Then standardized math and reading scores are averaged to produce pre-test and post-test.
7

11

they were chosen because they had the least missing information.8 Table 3 shows that there is little dierence between the control and intervention classes in terms of the three pre-treatment variables. The t-test results conrm this observation (details not shown here). Table 3 also shows that students are distributed evenly across the two types of intervention classes. Specically, 22.1% of the students in tutor classes and 23.1% of the students in pay classes were assigned to be tutees and payees, respectively. These numbers are quite reasonable. The average intervention class size is 43. Ten treated students are about 23 percent of the total class. Among the treated students (i.e., those with T reatment = 1), about half of them (those with T reatment Basic = 1) were not assigned to receive parental communication intervention, while the other half (those with T reatment Basic + Call = 1) received this intervention.

Evaluation Design

In this section we describe the evaluation strategy used to analyze the impact of paying-for-grades and peer tutoring on standardized test scores. Because of the nature of our compound intervention scheme, it is possible to estimate many dierent combinations of intervention eects; however, since we are most interested in the overall eectiveness of our pay and tutor programs, in this section and Section 4 we focus on estimating the total (or overall) program eects. In Section 5, we decompose the total eect into some of its basic components, including the pure eect of paying-for-grades, the pure eect of parental communication, and the pure eect of peer tutoring.
Our baseline survey asked the children to ll out a lot more background information. But many children had trouble understanding and lling out the survey questions by themselves
8

12

3.1

Evaluation Sample and Randomness Checks

Because of the design of our interventions, our sample contains two independent groups of control students, allowing us to assess the impact of payingfor-grades and peer tutoring using two dierent empirical approaches. The within-class controls are students from the bottom half of intervention classes who were not chosen to be active participants in the interventions. The across-class controls are students from the bottom half of classes in which neither intervention was implemented. Because we have two separate control groups, there will be dierent numbers of students in the analyses depending on whether we use within-class controls or across-class controls. For example, when using the within-class design for the pay intervention, the sample will include all of the students (about 20 per class) in the bottom half of the 47 pay intervention classes. In total the sample size for this analysis will be 888, with 443 treated students (the payees) and 445 within-class controls.9 The across-class design for the pay intervention eect includes the same set of payees and 690 across-class controls. The 690 across-class controls roughly correspond to the 20 poorest performing students in the bottom half of the 35 pure control classes. The within-class and across-class designs for the tutor program can be dened in a similar way. Table 4 shows the descriptive statistics and randomization checks for both the within-class design (panel A) and the across-class design (panel B) of the paying-for-grades and peer tutoring interventions. The sample sizes for each sub-experiment are also reported. In both designs, the dierences among the pre-treatment variables in the control and intervention groups are quite small. None are signicant at the 5% level.10 We also performed two-sample
In theory 47 20 = 940. The discrepancy between 888 and 940 is largely due to that fact that in some smaller classes we only invited 8 or 9 students to participate in our program. Another factor is missing information. 10 The number of treatment groups slightly dier between the within-class design and across-class design due to minor technical denition dierences.
9

13

Kolmogorov-Smirnov tests of the equality of distribution functions for all the pairs of variables reported in Table 4. The combined K-S p-values are well above 0.15 for almost all pairs (details not reported). One notable feature of our intervention is that it is likely to be costeective because of the use of individual-based randomization in the withinschool design of the program. Almost all previous studies randomized only at the school level, including one of the largest paying-for-grades experiments to date (reported in Fryer, 2010). This study included 261 American public schools across four school districts 123 intervention schools and 138 control schools. Although the total number of students participating in the study was large (around 38,000), the number of true experimental units was relatively small since the randomization was carried out at the school level. Moreover, because nearly all students in the intervention schools participated in the study, the total cost of the experiment, including $6.3 million distributed to students as prizes and other administrative expenses, was quite large. Our design allowed us to conduct the study with fewer than 6,000 students, of which around 850 were treated, and two independent groups of students served as controls. Because of this, the total cost of our program was only about $4,000 for all the prizes (exclusive of administrative costs).11

3.2

Empirical Strategy

The main empirical strategy is captured by the following OLS regression, which can be applied to both the within-class and across-class designs:
According to Duo, Glennerster and Kremer (2008), the purpose of using schoolbased randomization is often political, as providing extra resources to some students in a school but not to the others can generate resentment or discomfort. This was not a big concern to us because our program was implemented as a pilot project (shidian), which is quite common in China during economic reforms. Another concern is that spillovers from treatment to comparison groups in the same school can bias the estimation of treatment eects. This does not apply to our case as untreated students in our program classes were not eligible for prizes.
11

14

Post-testi = + T reatmenti + Xi + i

(1)

where Post-test and Treatment are the dependent and key explanatory variables described above and in Table 3. The matrix X includes our three control variables Pre-test, Male and Grade (also described above) as well as a set of class dummies. Standard errors, throughout the paper, are clustered at the class level. In the general model, is the total program eect, the main coecient of interest we examine in Sections 4 and 5. If we use the sample of schools in which we implemented the paying-for-grades intervention, = 1 where 1 is an estimate of the programs overall (or total) impact on payees (cash incentive + calling half of the parents). If we use the sample of schools in which we implemented the peer tutoring intervention, = 2 where 2 is an estimate of the programs overall (or total) impact on tutees (cash incentive + peer tutoring + calling half of the parents). Because each treated student has a within-class control with a nearly identical pre-test score, we can also estimate the overall impact using a matchedpair approach. In this case the match is based on the same-class pre-test ranking (i.e., we match the 40th treatment student in an intervention class with the 39th control student; then we match the 38th treatment student with the 37th control student, and so on). Besides using standardized test scores directly as the outcome variable, as a check of the robustness of our empirical approach we also use the posttest within-class ranking percentile as an alternative outcome variable. For example, for a class of 40 students, the bottom student (the 40th ranked student in terms of the post-test score) has a percentile score of 0. The 39th student has a percentile score of 0.025. The 48th student has a percentile score of 0.05 and so on. Using this outcome variable we can evaluate changes in the relative ranking within a class due to our interventions.

15

4
4.1

Total Program Eects


Impact on Tutees/Payees Using Within-Class Design

Graphical evidence of the program eects on payees/tutees using within-class design is provided in Figure 2. The distributions of the pre-test scores for the control and intervention groups are quite similar in the case of both the peer tutoring intervention (Panel A) and the paying-for-grades intervention (Panel B). In the peer tutoring intervention the post-test score of the tutees systematically shifts to the right of that of the control students (Panel C). The dierence is about 0.10 0.15. However, in the case of the payingfor-grades intervention, there is little dierence between the distributions of the post-test scores of the control and intervention groups (Panel D). In other words, the graphical evidence indicates a positive overall eect of peer tutoring on standardized test scores, but no eect of paying-for-grades. Using a regression similar to that specied in Equation 1, but excluding the control variables, it can be seen that the total treatment eect of peer tutoring is about 0.128 sd for the tutees (Table 5, panel A, column 1). After adding the three control variables (pre-test, Male, Grade), the measured eect of peer tutoring on the treatment students rises slightly to 0.137 sd (panel A, column 2). Both eects are statistically signicant at the 5% level. Table 5 also shows the estimates of the eect of peer tutoring on tutees post-test scores using matched-pair analysis (panel B, column 1). This method of analysis yields a slightly stronger measured impact of peer tutoring, at 0.14 sd. This result is also signicant at the 5% level. In contrast, the estimated eect of the paying-for-grades intervention is not signicantly dierent from zero. In particular, whether we use simple OLS (panel A, column 3), the full regression model from Equation 1 (panel A, column 4), or matching (panel B, column 2), the eect of the paying-forgrades intervention on payees is close to zero and statistically insignicant at the 10% level. 16

Using post-test within-class ranking percentile as the outcome variable yields results similar to those in which we use post-test scores as the outcome variable (results not reported for brevity). Analyses using dierent econometric models all yield the same result: that the within-class ranking percentile rises in peer tutoring intervention classes for the tutees (the coecients on the treatment variable range from 0.067 to 0.070, and are all signicant at the 1% level). For an average class of 43 students, the size of this eect implies that a typical tutee improved her academic performance enough over the course of the study period to overtake nearly 3 students on the post-treatment test. However, in the case of the paying-for-grades intervention classes, there is no measured eect of the intervention on payees using the within-class ranking percentile (consistent with results of analysis using post-test scores as the outcome variable).

4.2

Impact on Tutees/Payees Using Across-Class Design

The estimated impact on everyone in the program classes using across-class design is reported in Table 6. Section 4.2 will report estimated impact on tutees/payees (results in column 1), corroborating our results using withinclass design. Section 4.3 will discuss the estimated impact on other students in the program classes (results in columns 2-4), a result which could not be estimated using our within-class design because such students do not have within-class controls. The regression approach, based on the specication in Equation 1, indicates that the impact of peer tutoring on the post-test scores of tutees is 0.126 sd (Table 6, panel A, column 1). Even though this analysis uses a completely dierent group of control students, the magnitude of the estimated eect is nearly identical to that found using the within-class design, and both are statistically signicant at the 10% level. This is also the case using the alternative outcome variable of within-class post-test ranking percentile,

17

which yields an estimated eect of 0.042 (Table 6, panel B, column 1). Although slightly lower than the estimated eect found using the within-class analysis (0.07), the coecients in both cases are statistically signicant at the 1% level. The estimates of the eect of paying-for-grades using across-class controls are consistent with those found using within-class controls (Table 6, panel A/B, column 1). The estimated eect of paying-for-grades on the post-test scores of payees is close to zero and not signicant at the 10% level. The same is true when using within-class ranking percentiles as the outcome variable.

4.3

Eects on Other Students in the Program Classes

Beyond the robust positive eect on tutees (and the insignicant eect on payees), we are also concerned about the eect of the interventions on other students in the intervention classes, who were neither tutees nor payees. The presence or absence of eects is important to understand for several reasons. First, in the case of peer tutoring classes, we are interested in whether the grades of the tutors were adversely aected. Second, in both types of classes, we are interested in whether there are spillover eects on students in program classes who were not active participants in the intervention. Within each intervention class, there are four types of students, described in Table 2: (1) Tutees/payees (i.e., students in the bottom half of the class who were in the treatment group); (2) Within-class controls (i.e., students in the bottom half of the class who were not in the treatment group); (3) Quartile 2 students, who did not participate actively in any part of either intervention; and (4) Quartile 1 students, who served as tutors in tutoring classes and who did not participate actively in the intervention in paying-forgrades classes. We compare these four types of students from the intervention classes to the same types of students from the pure control classes. Using a regression based on Equation 1, the estimated treatment eects are reported in Table 6. Because we are interested in the eects on the four dierent types of students, we use both post-test scores (reported in panel A) and post-test 18

within-class ranking percentile (reported in panel B) as outcome variables. To illustrate the analysis of the intervention eects on the four dierent student types, we compare Quartile 2 students in the paying-for-grades classes to Quartile 2 students in the pure control classes. The estimated eect using post-test scores as the outcome variable is 0.114 (Table 6, panel A, column 3). The estimated eect using post-test within-class ranking percentile as the outcome variable is -0.016 (Table 6, panel B, column 3). Importantly, neither coecient is signicant at the 10% level. In general, our results suggest that students in the intervention classes do not have lower post-test scores compared with the same type of students in the pure control classes. Almost all of the estimated coecients are positive, though none of them are signicant at the 10% level. Importantly, the tutors did not suer academically ( = 0.081 Table 6, panel A, column 4). Using post-test within-class ranking percentile as the outcome variable yields similar results (panel B, column 4). Most of the estimated coecients are close to 0 and not statistically signicant at the 10% level. If we focus on the impacts on the post-test within-class ranking percentiles of dierent types of students in tutoring classes (panel B, rows 1), we nd an interesting, though intuitive, pattern. While the tutees improved their ranking percentile by 0.042 (panel B, column 1, row 1), the ranking percentiles of the within-class control students dropped by 0.025 (panel B, column 2, row 1). Both eects are statistically signicant at the 10% level. The dierences imply that tutees, on average, moved up the ranking ladder relative to their within-class control students. In relative terms, the gains of the tutees come at the loss of the within-class control students. In absolute terms, however, within-class control students in peer tutoring classes did not become worse o, as the coecient on their post-test score is close to zero and not statistically signicant (panel A, column 2, row 1). In sum, the improvement in standardized test scores for tutees is an eciency improvement. The standardized scores of untreated students in intervention classrooms were unaected. Importantly, the tutors scores did

19

not suer after interacting with tutees. Tutees experienced an improvement in their relative ranking in the class at the expense of untreated students in the bottom half, who experienced a decline in their relative ranking, but the relative rankings of students in the top two quartiles did not change. These neutral results lend further strength to the positive and signicant program eect estimations for tutees.

4.4

Heterogeneous Eects

If we run the regression of Equation 1 for dierent subgroups of students (by pre-test scores, by gender, and by grade), we nd that there may exist some heterogeneous program eects of the peer tutoring intervention, but not of the paying-for-grades intervention (Table 7). In the case of the peer tutoring intervention, it appears as if the eect is higher for girls, for students in higher grades (i.e. 5 and 6), and for students with below-median pre-test scores, although the dierences are not consistently signicant across within-class and across-class designs. Among these heterogeneous eects, the hypothesis that tutees in grades 5 and 6 benet more from our program seems to have the strongest support across dierent specications and when using the dierent control groups. The size of the coecients of these groups suggests that if we had focused our programmatic attention on students in grades 5 and 6, we would have boosted our overall program eect from about 0.14 to 0.21, a 50% increase.

5
5.1

Decomposing the Total Program Eects


Empirical Strategy

Our compound intervention design randomly divides treated students into four groups, each receiving one particular combination of three interventions, from which we can potentially estimate four basic treatment eects, namely 11 , 12 , 21 and 22 as shown in Table 1. 20

Because every treated student received a cash incentive contract, the payees who did not also participate in the parental communication intervention can serve as a baseline. Additional treatment eects may be calculated as the marginal eect on top of the baseline eect. 1. pure eect of paying-for-grades: 11 2. pure (i.e. marginal) eect of parental communication: 12 11 (conditional on the pay program) or 22 21 (conditional on the tutor program) 3. pure (i.e. marginal) eect of peer tutoring: 21 11 (conditional on no phone call) or 22 12 (conditional on phone call) Of course, we are also interested in the overall eectiveness of our pay program 1 = (11 +12 )/2, and the overall eectiveness of our tutor program 2 = (21 + 22 )/2. These two eects (total program eects) are averaged across the rows in Table 1 because assignment to the parental communication intervention is random. Our estimations of the total program eects on post-test scores show that 1 = 0 for the pay program. Because total eect 1 = (11 + 12 )/2, in theory we can infer that 11 = 12 = 0.12 So we infer from the estimation that the pure eect of paying-for-grades 11 and the marginal eect of parental communication 12 11 (conditional on no tutoring) are both zero. Because our estimations of the total program eects on post-test scores for the peer tutoring intervention is positive, with 2 0.14, we can not directly infer the estimations for 21 and 22 from the relationship 2 = (21 + 22 )/2. Therefore we do not know how much of the total program impact is due to peer tutoring, and how much is due to parental communication. We can estimate 11 and 12 (or 21 and 22 ) simultaneously in one OLS regression by using either a within-class or across-class design for the pay program (or for the tutor program):
In theory both 11 and 12 should be non-negative numbers. We observe no evidence that this is not true. This is also assumed for 21 and 22 .
12

21

Post-testi = + j1 TreatBasicj + j2 TreatBasic Callj + Xi + i i i

(2)

where j = 1 for pay program samples, and j = 2 for tutor program samples. TreatBasic, and TreatBasic Call are abbreviations for the variable Treatment Basic, and Treatment Basic + Call, respectively. Other variables are similarly specied in Equation 1. Only about 70% of the students provided valid phone numbers for their parents in the pre-test survey. Students with missing phone numbers are like never-takers in an encouragement design. In a typical encouragement design the analysts do not know who the never-takers are and appropriate statistical procedures need to be designed to handle the problem. In our case the identity of never-takers is quite obvious, so we simply drop them to improve the estimation eciency of our model.13 Table 3 shows that students with missing phone numbers are random across the three types of classes. It is also random within program classes (results omitted). We also checked the randomness of our program assignments after dropping students with missing phone numbers (Table 8).

5.2

Decomposition Results

We report our estimation results in Table 9. Let us rst focus on the results from the within-class analysis (columns 1 and 3). We nd that 21 = 0.11 (signicant at 10%), and 22 = 0.197 (signicant at 5%). These results imply that the peer tutoring intervention without the parental communication intervention has an impact of 0.11. The addition of the parental communication
Rigorously speaking, if we do not drop those with missing phone number, our estimated program eects for 21 and 22 are intention-to-treat eects (ITT) eects (i.e. the impact of being assigned to receive parental phone call). If we drop these students, the estimated program eects are real treatment eects of receiving actual parental phone call, which tend to be much higher than the ITT eects. In a slight abuse of notation we still call them 21 and 22 .
13

22

intervention in the tutoring program improved test scores by 0.197 0.110 = 0.087. If we average 21 and 22 , we can infer a new estimation of about 0.153 for the overall eect of the peer tutoring program (2 ) based on the relationship 2 = (21 + 22 )/2. This number seems to be slightly larger than the directly estimated total eect 2 0.14 in the previous section. This discrepancy is due to missing phone numbers for 30% of our entire sample. If we assume that those with missing phone numbers also have a treatment eect of 0.110, we nd that (0.197 50% + 0.110 50%) 70% + 0.110 30% = 0.140, which is exactly equal to the directly estimated total tutoring eect from the previous section. In other words, if there were no missing phone numbers, the total eect of the tutor program would be 0.153 instead of 0.140. Consistent with our prior results, directly estimated 11 and 12 are not signicantly dierent from zero at the 10% level. The results from the across-class design (column 2 and 4) are consistent with the results from the within-class design. The estimated 21 is 0.12, but is only marginally signicant. This may be due to the reduced sample size caused by dropping students with missing phone numbers. The estimated 22 is 0.24, and is still signicant at 5%. Directly estimated 11 and 12 are not signicantly dierent from 0 at the 10% level. Summarizing our results: 1. pure eect of paying-for-grades intervention: 11 = 0 2. pure eect of parental communication intervention: 12 11 = 0 (conditional on participating in the paying-for-grades intervention) or 22 21 = 0.087 (conditional on participating in the peer tutoring intervention) 3. pure eect of peer tutoring: 21 11 = 0.11 (conditional on no parental communication) or 22 12 = 0.197 (conditional on parental communication) 4. total eect of paying-for-grades program 1 = (11 + 12 )/2 = 0 23

5. total eect of peer tutoring program 2 = (21 + 22 )/2 = 0.153 (assuming no missing phone number) or 0.14 (assuming 30% missing phone number) 6. Mostly importantly, the integrated strategy (using all 3 interventions) can boost the total program eect to 22 = 0.197. Peer tutoring and parental communication work as complementary inputs in the underlying production function for learning. The marginal eect of peer tutoring is higher in the presence of parental communication, and the marginal eect of parental communication is higher in the present of peer tutoring. These results suggest that increasing one input may have spillover eects on the other.

Robustness Checks on Rules, Cheating, and Bench-mates

We have shown that our results are robust with two independent sets of control groups, and dierent results can corroborate each other. Here we will do some additional robustness checks with respect to several aspects of our intervention.

6.1

Rule Knowledge and Subjective Evaluation

In order for our programs to have made the dierence we claim, one key condition is that students know the rules of the tournaments. The responses of a random end-of-program survey (N=273) show that the majority of the intervention students know the prize structure well. Only 7 out of 60 tutors (12%), 8 out of 60 tutees (13%), and 12 out of 48 payees (26%) responded that they did not know it well. On the other hand, the majority of the nonparticipating students did not know the prize structure well, with 57 out of 106 non-participating students (54%) providing a negative response. 24

In addition, students subjective evaluation of the interventions is a helpful metric by which to judge the plausibility of our ndings. We found that students with better knowledge of the prize structure were far more likely to agree that the program made a positive impact. Whats more, students who are more deeply involved in the intervention (with tutors and tutees considered to be more involved than payees) are more likely to agree that the program had a positive impact. The percentage of students in each group who agree increases according to their involvement: non-participating students (38%), payees (64%), tutees (70%), and tutors (83%). This order is compatible with our estimation results. Paying-for-grades students were more negative about the intervention, while tutors and tutees were more positive. Common sense suggests that tutors who help others will tend to believe that they have made a bigger positive impact compared with those who received help.

6.2

Cheating on Exams

Our peer tutoring intervention oered both incentives (peer-based scholarship for the peer pairs) and opportunities (sitting on the same bench) for tutors and their tutees to cheat in our post-test exams; however, since the exam was monitored by our own enumerators, cheating should not have been a serious problem. Whats more, the impact of random parental communication intervention also suggests that something deeper than cheating on exams is driving our results. Nevertheless, we want to show some direct evidence that our eect was not due to collusion between the tutors and tutees. We do not have to worry about other types of cheating, because we did not observe a treatment eect for the paying-for-grades experiment, which also oered incentives to cheat (but not opportunity). For each tutee (i) and her tutor, we calculated the percentage of the questions for which they came up with identical answers on their math (or reading) test (xi ). Next, we dened a group of counterfactual tutors com25

prised of the other students who took the same test and got almost the same score as the tutor concerned. Typically there were about 50-200 such students for each tutee. Similarly we can calculate the percentage of the identical answers between the tutee and each counterfactual tutor. If xi is larger than the 95th percentile of these calculated percentages, we conclude that the tutee and her tutor cheated because they came up with too many identical answers relative to a group of counterfactual tutors. Our algorithm results show that about 31 out of 413 tutees (or 7.5%) tutees can be thus considered as colluding with their tutors on the math exam. The corresponding number for the reading exam is 14 tutees, or 3.4%. Because both percentages are very close to the theoretical level of 5% (probability of type I error), cheating does not appear to be a serious problem in the intervention.

6.3

Pure Peer Eects

In the economics of education literature, an under-performing student may benet from the proximity of a high-performing peer, simply by living in the same dorm or studying in the same classroom. Since our tutees are assigned to sit next to a high-performing tutor, one alternative explanation of our results is that it is entirely due to pure peer eects (i.e., the bench-mate eect), making additional incentives or interventions unnecessary. The impact of the parental communication intervention is an important piece of evidence against this hypothesis. If peer eects are driving our results, then the results should not be aected by the parental communication intervention. More direct evidence against the peer-eects hypothesis can be produced by estimating the bench-mate eect. First, we check that tutor-tutee matching is random. Even though all tutees are assigned a high-performing tutor, the tutors have quite diverse pre-test scores (mean = 0.88, sd = 0.27). The correlation between the withinclass pre-test score ranking between tutees and tutors is only 0.0018. If we regress the tutee ranking on the matched tutor ranking, the t-statistic for the 26

coecient is only 0.03. If we use the pre-test score instead, the corresponding t-statistic is only -0.66. Second, following the standard specication in the peer eects literature, we regress the post-test score of each tutee on his or her randomly assigned tutors pre-test score. The estimated bench-mate eect (for a sample of about 400 tutees) is only 0.025, with a t-statistic of 0.11. So we do not have strong evidence that peer eects exist for the benchmate. Our weak peer eects estimation is consistent with previous studies summarized in Kremer and Levy (2008).

Conclusion

After a full decade of implementing cash incentive programs around the world, we have arrived at a point from which we can evaluate whether bypassing schools and parents is a successful strategy, and design better strategies based on eld experience. Our interventions testing cash incentives, incentivized peer tutoring, and parental communication suggest that the implementation of pure cash incentive programs that bypass other learning factors is a shortsighted policy and is not eective at enhancing learning. Tutoring appears to be a central factor aecting learning in our study. The intervention including both a cash incentive and parental communication but not tutoring did not aect learning, while any intervention that included incentivized peer tutoring was always eective at enhancing actual learning. Our program results suggest that improving weak teaching is central to remedying education, consistent with the empirical evidences summarized in Glewwe and Kremer (2006). Incentivized peer tutoring represents an effective new strategy combining cash incentives and peer tutoring. Future interventions may also want to exploit the benecial interactions between parental communication and peer tutoring, and use integrated intervention approaches involving students, tutors (or teachers), and parents. The study design is not perfect. The eects reported were short-term impacts; the long-term eect if one exists is unclear, although in an27

other working paper, we reported some encouraging long-run impacts of a similarly-designed incentivized peer tutoring intervention after the incentive was removed. That working paper also shows that peer tutoring without an incentive contract is probably less eective. The design of our current intervention also does not allow us to study the potential interaction between the cash incentives for grades and the tutoring/parental communication interventions. It is possible that the cash incentives for grades intervention works like the parental communication intervention, not aecting learning directly, but making tutoring more eective. It was not possible for us to explore all these potential interactions in one experiment. Future research in this direction may enhance our understanding of cash incentives in education.

References
[1] Angrist, Joshua, Daniel Lang, and Philip Oreopoulos, 2009. Incentives and Services for College Achievement: Evidence from a Randomized Trial, American Economic Journal: Applied Economics, vol. 1(1), pages 136-63. [2] Angrist, Joshua & Victor Lavy, 2009. The Eects of High Stakes High School Achievement Awards: Evidence from a Randomized Trial, American Economic Review, vol. 99(4), pages 1384-1414. [3] Duo, Esther & Rachel Glennerster & Michael Kremer, 2008. Using Randomization in Development Economics Research: A Toolkit, Handbook of Development Economics, Elsevier. [4] Fryer, Roland G., Jr, 2010. Financial Incentives and Student Achievement: Evidence from Randomized Trials, NBER Working Papers 15898, National Bureau of Economic Research, Inc.

28

[5] Glewwe, Paul & Kremer, Michael, 2006. Schools, Teachers, and Education Outcomes in Developing Countries, Handbook of the Economics of Education, Elsevier [6] Hanushek, Eric, 1995 Interpreting Recent Research on Schooling in Developing Countries, World Bank Research Observer, 10(2): 227-246. [7] Jackson, C. Kirabo, 2010. A Little Now for a Lot Later: A Look at a Texas Advanced Placement Incentive Program, Journal of Human Resources, University of Wisconsin Press, vol. 45(3). [8] Kremer, Michael, and Dan Levy. 2008. Peer Eects and Alcohol Use among College Students. Journal of Economic Perspectives, 22(3): 189206. [9] Kremer, Michael, Edward Miguel, and Rebecca Thornton (2009), Incentives to Learn, Review of Economics and Statistics, vol 91(3), p437456. [10] Lai, Fang, Chengfang Liu, Linxiu Zhang, Renfu Luo, Xiaochen Ma and Scott Rozelle. 2010. Benchingmarking Standardized Test Scores in Migrant and Poor Rural Schools in China, Working Paper, Center for Chinese Agricultural Policy, Chinese Academy of Sciences, Beijing. [11] Middleton, S., Perren, K., Maguire, S., Rennison, J., Battistin, E., Emmerson, C., & Fitzsimons, E. 2005. Evaluation of education maintenance allowance pilots: Young people aged 16 to 19 years, nal report of the quantitative evaluation. DfES Report No. RR678. London: DfES [12] Mauldon, J., Malvin, J., Stiles, J., Nicosia, N., & Seto, E. 2000. The impact of Californias Cal-Learn demonstration project, nal report. UC Data Archive and Technical Assistance, UC Data Reports: Paper CLFE.

29

[13] Slavin, Robert. E. 2009. Can nancial Incentives enhance educational outcomes? Evidence from international experiments. Baltimore, MD: Johns Hopkins University, Center for Data-Driven Reform in Education [14] Spencer, M.B., Noll, E., & Cassidy, E. 2005. Monetary incentives in support of academic achievement: Results of a randomized eld trial involving high-achievement, low-resource, ethnically diverse urban adolescenes. Evaluation Review, 29, 199-222. [15] Topping, Keith J., 2005. Trends in Peer Learning, Educational Psychology, 25(6): 631-645

30

Table 1: The compound intervention structure (paying-for-grades for everyone + a cross-cutting design of peer tutoring and parental communication) for the treated students parental communication assignment tutoring no yes Total no pay grades pay grades + parent (N= 225, 11 ) (N= 223, 12 ) (N=448, 1 ) yes pay grades + tutor pay grades + tutor + parent (N=207, 21 ) (N= 201, 22 ) (N=408, 2 ) Total 432 424 856
Note: Our compound design divides treated students into 4 groups. Compared with relevant control groups, we can estimate four dierent types of treatment eects: 11 represents students who received only the paying-for-grades intervention; 12 represents students who received both the paying-for-grades and the parental communication interventions; 21 represents students who received the paying-for-grades and the peer tutoring interventions; and 22 represents students who received all three interventions.

31

Table 2: Illustration of within-class program assignment to treatment and control groups in a typical class of 40 students Program Class Type Rank 1 2 ... 10 11 12 ... 20 21 22 23 ... 39 40 Tutor tutor tutor ... tutor ... tutees control tutee tutees control ... tutees control tutee Pay ... ... payees control payee payees control ... payees control payee Pure Control ... ... control control control ... control control

for both for both for both for both for both

Note: In half of the intervention classes, we picked students in the bottom half with odd-numbered rankings (21, 23, . . . , 39) to participate in the program. The even-numbered students in the bottom half were left untouched and act as control students. In the other half of intervention classes (shown here), we selected the even-numbered students in the bottom half to act as treatment students instead, and left the odd-numbered students untouched (and function as control students). The bottom half students in the pure control classes serve as extra across-class control students for both payees and tutees.

32

Table 3: Data summary (by program class types) Pre-test (standardized baseline test score) Post-test (standardized evaluation test score) Male (1/0) (1 if male) Grade (grades in 3, 4, 5, 6) Treatment (1/0) (1 for payee/tutee) Treatment Basic (1/0) (1 for payee/tutee without phone call) Treatment Basic+Call (1/0) (1 for payee/tutee with phone call) N of schools N of classes N of students Tutor -0.00310 (0.861) 0.0378 (0.937) 0.572 (0.495) 4.453 (1.124) 0.221 (0.415) 0.112 (0.315) 0.109 (0.311) 11 44 1850 Pay -0.00379 (0.828) 0.0324 (0.924) 0.557 (0.497) 4.391 (1.100) 0.231 (0.422) 0.116 (0.320) 0.115 (0.320) 12 47 1933 Control 0.00960 (0.818) -0.0285 (0.955) 0.548 (0.498) 4.339 (1.110) 0 (0) 0 (0) 0 (0) 23 35 1486

33

Table 4: Data and randomness checks of the evaluation samples for the tutees/payees and their control groups in two alternative evaluation designs Peer Tutoring (1) (2) (3) Control Treat Di. A: within-class design pre-test Male Grade N B: across-class design pre-test Male Grade N -.601 .57 4.43 690 -.665 .622 4.49 407 .0638 -.601 -.0521* .57 -.0581 4.43 690 -.671 .604 4.45 440 .0699 -.0341 -.0131 -.643 .61 4.49 405 -.672 .62 4.49 405 .0288 -.0099 .0074 -.684 .569 4.46 445 -.668 .606 4.46 443 -.0161 -.0378 .0059 Paying-For-Grades (4) (5) (6) Control Treat Di.

Note: Di refers to the dierence between control and treatment groups. To save space we only report those t-tests which produce a signicant dierence between these two groups, with * p < 0.1, ** p < 0.05, *** p < 0.01

34

Table 5: Regression and matching estimations of eects of peer tutoring and paying-for-grades using within-class control students (dependent var: posttest score) Peer Tutoring A: regression Treatment pre-test Male Grade Class Dummies R-squared N B: matching Treatment Exact matches N 0.141** (0.059) 95% 810 -0.017 (0.056) 94% 888 no 0.005 810 0.128** (0.054) 0.137** (0.056) 0.336*** (0.056) -0.081 (0.062) -0.081*** (0.016) yes 0.201 810 -0.014 (0.067) -0.021 (0.067) 0.395*** (0.050) -0.042 (0.062) -0.293*** (0.006) yes 0.217 887 Paying-For-Grades

no 0.000 888

Note: Our samples are the bottom students in the within-class design (bottom treated students (i.e. tutees and payees) and their within-class controls. Exact matching on pretest within-class ranking (i.e. 40th student matched with 39th student, 38th student matched with 37th student, etc). Standard errors adjusted for intragroup correlation at the class level. * p < 0.1, ** p < 0.05, *** p < 0.01

35

Table 6: Regression estimations of treatment eects for all students in program classes using across-class control students (dependent var: post-test score/percentile) (1) A: dep. var. = post-test score Treatment (tutor) R-squared N Treatment (pay) R-squared N B: dep. var. = post-test percentile Treatment (tutor) R-squared N Treatment (pay) R-squared N 0.042*** (0.014) 0.089 1097 -0.007 (0.018) 0.111 1129 -0.025* (0.014) 0.116 1149 -0.010 (0.017) 0.110 1158 -0.005 (0.014) 0.027 1013 -0.016 (0.014) 0.023 1030 0.002 (0.019) 0.023 767 0.042** (0.018) 0.049 792 0.126* (0.069) 0.095 1097 -0.006 (0.082) 0.108 1129 -0.014 (0.071) 0.138 1149 0.003 (0.072) 0.103 1158 0.094 (0.167) 0.033 1013 0.114 (0.168) 0.039 1030 0.081 (0.079) 0.028 767 0.119 (0.087) 0.031 792 (2) (3) (4)

Note: Column 1, 2, 3, and 4 estimates the impact of the program on the tutees/payees (treated students), tutees/payees within-class controls, quartile 2, and quartile 1 students in the program classes respectively. Control groups are bottom-half, bottom-half, quartile 2, and quartile 1 students in the control classes respectively. Control variables in OLS regressions include Male, Grade, pre-test, and class dummies. Standard errors adjusted for intragroup correlation at the class level. * p < 0.1, ** p < 0.05, *** p < 0.01

36

Table 7: Regression estimations of treatment eects by groups (dependent var: post-test score) Gender (1) (2) Male Female A: within-class design Treatment (tutor) R-squared N Treatment (pay) R-squared N B: across-class design Treatment (tutor) R-squared N Treatment (pay) R-squared N 0.092 (0.084) 0.089 646 -0.034 (0.097) 0.100 658 0.172* (0.101) 0.096 451 0.032 (0.092) 0.104 471 0.039 (0.104) 0.108 556 -0.005 (0.115) 0.087 578 0.206** (0.092) 0.095 541 -0.017 (0.114) 0.138 551 0.141 (0.106) 0.053 552 -0.024 (0.117) 0.087 575 0.092 (0.080) 0.054 545 0.016 (0.095) 0.027 554 0.086 (0.083) 0.239 498 -0.099 (0.088) 0.290 521 0.201 (0.126) 0.248 312 0.087 (0.098) 0.230 366 0.070 (0.081) 0.233 409 -0.071 (0.103) 0.136 454 0.213** (0.076) 0.181 401 0.034 (0.085) 0.278 433 0.230** (0.100) 0.153 405 -0.012 (0.078) 0.315 443 0.041 (0.063) 0.212 405 -0.007 (0.099) 0.183 444 Grade (3) (4) 3-4 5-6 Pre-Test (5) (6) 50% 50%

Note: Control variables in OLS regressions include Male, Grade, pre-test, and class dummies. Standard errors adjusted for intragroup correlation at the class level. * p < 0.1, ** p < 0.05, *** p < 0.01

37

Table 8: Randomization checks of the phone call assignment Tutee No Call pre-test Male Grade N -.608 .587 4.68 143 Call -.586 .624 4.68 133 Di -.0223 -.0366 .00163 No Call -.604 .572 4.71 152 Payee Call -.578 .517 4.68 143 Di -.0253 .0549 .0322

Note: Our samples are treated students (tutees and payees) with non-missing phone numbers. * p < 0.1, ** p < 0.05, *** p < 0.01

38

Table 9: Regression estimations of eects of basic intervention and additional parental communication (dependent var: post-test score) Peer Tutoring (1) (2) with class across class Treatment Basic Treatment Basic+Call pre-test Male Grade Class Dummy R-squared N 0.110* (0.065) 0.197** (0.091) 0.330*** (0.064) 0.043 (0.079) -0.229*** (0.028) yes 0.208 553 0.121 (0.083) 0.244** (0.104) 0.345*** (0.062) -0.054 (0.063) -0.075* (0.038) yes 0.098 715 Paying-For-Grades (3) (4) within class across class -0.107 (0.091) 0.009 (0.080) 0.457*** (0.069) -0.092 (0.082) -0.289*** (0.029) yes 0.249 585 -0.072 (0.099) 0.052 (0.097) 0.432*** (0.064) -0.140** (0.057) -0.077** (0.036) yes 0.131 728

Note: Our samples are restricted to the bottom half students with non-missing phone numbers in the within-class design (column 1 and 3) and across-class design (column 2 and 4). Standard errors adjusted for intragroup correlation at the class level. * p < 0.1, ** p < 0.05, *** p < 0.01

39

Figure 1: Program Design Chart 340 schools 23 schools

11 schools 40 44 tutor classes each class (e.g. N=40) 18 control classes bottom 20 students in each class as extra controls bottom 20 students 10 tutees (even #) 10 tutees controls (odd #) 17 control classes bottom 20 students in each class as extra controls

12 schools

47 pay classes each class (e.g. N=40)

bottom 20 students 10 payees (even #) 10 payees controls (odd #)

other 20 students

top 10 students 10 tutors

middle 10 students

Figure 2: Graphical evidence of eect of peer tutoring and paying-for-grades on standardized test scores using pre-test and post-test distributions

A: Peer Tutoring
.6 .6

B: PayingforGrades

Density .2 .4

2 0 PreTest Score

Density .2 .4

2 0 PreTest Score

C: Peer Tutoring
.8 Density .4 .6 Density .4 .6 .8

D: PayingforGrades

.2

1 0 PostTest Score

.2

1 0 PostTest Score

Note: The solid lines refer to the treatment groups. The dashed lines refer to the within-class comparison groups. (sample: bottom half students in the program classes)

41

You might also like