Organization of 
Introduction to Analyzing Your Data Analyzing assessment data allows you to interpret your student outcomes: Do you understand your population and the effects of your program? Meaningful data interpretation can be guided by questions such as the following:
The rest of this discussion will help understand how to approach data analysis and interpretation in a way that you can answer these kinds of questions. To do so, we need to lay some groundwork in some basic statistical concepts and an understanding of the kinds of assessment scores a program will typically use. What About Statistics?


What statistics do you need to know? Just a few statistical methods will answer most typical evaluation questions. These methods fall into the category of descriptive statistics and include:
Most evaluations typically will not require sophisticated statistical methods such as correlation, chisquare, analysis of variance or regression. Thus, we will not describe these approaches in this basic Toolkit. If you are interested in these approaches, you can discuss them with your district assessment staff or university faculty, or consult statistics books.
Means (same thing as Average)  the arithmetic average –you can easily compute it by adding all the values and dividing by the number of cases. Compare means does just that. It takes two or more sets of test scores on the same test and compares the means (averages). The sets of scores can be:
Note: In order to use this statistical procedure, students must be coded for the comparisons you want to make. For example, if you want to make comparisons of ELLs vs. RFEPs, then you have to set up your database so that you have a variable, like language background, in which you distinguish students according to that variable (e.g., ELL vs. RFEP vs. EO). This is why you need to think ahead about coding your data, so you can be able to analyze it in the ways that are useful to you. (See examples in Section 4 of the Toolkit.) Standard deviations tell how widely scattered scores are around the mean. The bigger the standard deviation, the greater the difference among the scores. For example, if you tested a group of students on a reading test and then averaged (or computed the mean of) seven students' scores  35, 40, 45, 50, 55, 60 and 65  the mean would be 50. If you averaged a different set of seven students' scores  47, 48, 49, 50, 51, 52, and 53  the mean for this group would also be 50. But there's obviously a lot of difference (or deviation) between the scores in the first set—much more so than in the second set (i.e., in the first set, there are much lower and higher scores than in the second set, which are grouped closer together) and that difference is captured in the standard deviation. Range refers to the highest and lowest scores in a set of scores. In the example above, the first data set had a range of 35 to 65. The second set of scores had a much smaller range, 47 to 53. Remember that both sets of scores had the same average, 50. The range shows how much territory the scores covered. Frequency counts are merely the number of scores or people in a category. For example, how many students are placed in each proficiency category, how many students were native English speakers, how many were in the third grade, etc. Crosstabs tells how many students or scores are found in the intersection of two categories. For example, how many third graders were native Spanish speakers, how many students who scored at the intermediate proficiency level in 2005 also scored as intermediate in 2006, etc. [Note. The database would have separate columns for each variable you were interested in crossing or intersecting – in this example, that would be language groups, grade levels, 2005 proficiency levels, and 2006 proficiency levels. See Section 4, Example 1 for information in how to do this coding]. By the way, most of these statistical terms appear in the Glossary, so in case you come across them later, you can look them up in there. HINTS. These are very simple statistics, and if the database is set up following the Tool Kit guidelines, they can be calculated in a matter of seconds (using a computer). The trick is to know when each of those statistics is appropriate. To know which statistics are appropriate, it's important to know something about test scores, so you may want to review Section 5 from time to time. Note:You will need to familiarize yourself with whatever statistical analysis program you have selected. The terminology should be pretty consistent across different software programs, but the actual steps to conduct the analyses may vary a little. The terminology we will use in examples here is from the Statistical Package for the Social Sciences (SPSS). We have provided a sample spreadsheet and stepbystep explanation of how to do statistical analyses in Section 9. After reading this section, the guidelines presented in Section 9 can be used to practice the procedures described here
TIPS: How Different Kinds of Scores Can and Can't Be Analyzed Descriptive statistics— means, standard deviations, ranges, compare means—are appropriate for NCE scores, standard scores and scale scores (described in Section 5). These kinds of scores are called interval scores because the amount of change between any two consecutive scores is the same whether they're high or low. They are like a thermometer in that regard. A one degree change from 41 to 42 is the same amount of change as from 65 to 66. Some descriptive statistics— means, standard deviations, compare means—are not appropriate for percentiles, grade equivalents, or any kind of categorical scores, including stanines and performance levels (described in Section 5). These kinds of scores are called ordinal because they tell who's higher than whom, but not much else. The amount of change for a student moving from Level 2 to Level 3 isn't necessarily the amount of change for a student moving from Level 3 to Level 4. Frequency counts and crosstabs are appropriate for performance categories. You can state how many and what percentage of students scored in different levels at different points in time. Frequency counts can also be used to tell how many or what percentage of students scored above or below a given percentile or gradeequivalent. Do not attempt to average percentiles and gradeequivalents. If you want to obtain an average percentile, first average the NCE scores, then consult the NCE/Percentile conversion table (described in Section 5 and available in Appendix) for the corresponding percentile rank . Do not attempt to average performance categories or levels. A student is in level 1 or level 2 or level 3, etc. It doesn't make any sense to say, on average, students scored at level 2.3. Just count the number and percentage of students who scored at each level. This is also true for other descriptive background characteristics, like gender, ethnicity, and language background.
Matching Statistical Analyses to Evaluation Questions Let's approach the question of matching statistical analyses to evaluation questions through an example from Section 2 of this Toolkit. Question 1: What kind of progress have students in the different groups made in their oral and written proficiency in each language? This question requires the following kinds of data to be kept in the database.
Let's use oral proficiency in English and Spanish as an example. (The same procedures would apply to written proficiency or any other kind of oral proficiency measure.) Let's assume your project uses the SOLOM (described in Section 3), which yields scores on a scale of 525. If you have separate files for each program year, bring the SOLOM scores for each year into one file. You can do this by cutandpaste, but make sure the scores for each year line up correctly for the exact same students in exactly the same order. This is where recording the student ID numbers becomes very helpful. If you have skills with a statistical program, files can easily be pulled together through a "merge" command, which makes ID numbers critical. Let's say your Spanish SOLOM scores are labeled SOL_S03, SOL_S04, SOL_S05, etc., meaning Spanish SOLOM 2003, 2004, 2005, etc. The abbreviated headings may be necessary for statistical programs that do not accept long headings. Now if you have a single column for the language groups (L1 in chart below), coded S for native Spanish speaker and E for native English speaker (better yet, 1 for one of them and 2 for the other), you can calculate the annual averages very quickly. If you
are interested in the longrange outcomes for students who have been
served in the program the longest, for example, the current fifthgraders,
you should select Grade 5 for the analysis and exclude the other grade
levels. You can do this a couple of ways. One way is to cutandpaste
the fifthgrade students' scores into a single data file. Another is
to use the "select cases" feature in your statistical program.
Select "compare means" for your analysis. Then select each
year's Spanish SOLOM scores as the dependent variable. Then select the
coded language groups as the independent variable. Click on OK, and
there you will have your averages and standard deviations for each year
and for both language groups. (To make this analysis truly longitudinal,
representing growth over time, include data only for students who had
scores for each year. Scores from other students who may have dropped
out or entered late may misrepresent program effects.) (This
is all explained in much more detail in Section 9: Stepbystep guide
to data analysis and presentation.) The statistical output from
this compare means analysis might look like this: Table 1 Average
SOLOM Scores for Students Across Academic Years 2001 to 2006 Selecting the compare means analysis gave separate information for the fifthgrade English speakers (E) and Spanish speakers (S) regarding their Spanish proficiency development across years 2001 to 2006. You will see that for each year, the output shows the mean (average) for each group, the number of students whose scores were used (N) and the standard deviation of the scores. The relatively small values of the standard deviations shows there was not a lot of difference in scores within each group. The output also provides the "Total," the annual average for the combined groups of English and Spanish speakers. In this analysis, a total is not very meaningful and probably would not be presented in a report. However, note that the standard deviation is larger for the total group than for the separate groups. That's expected because the Spanish speakers had generally higher scores on the Spanish SOLOM, and the English speakers typically had lower scores, so when you combine the two groups, there is a greater spread of scores. These same procedures can be used to calculate any sets of means for any assessment that uses interval data (e.g., NCE scores and scale scores). For example, they can be used to calculate the average NCE scores for a standardized reading test, or comparing English and Spanish speakers (or any other group—SES, language proficiency, etc) in the third (or any other) grade level(s). (Statistical output may mean a lot to the person who runs the analysis. Section 7 will show how the output of these analyses can be clearly presented in tables and graphs, making them easier for most audiences to understand. If you want to see this data presented in a chart, see Figure 1 in Section 7.) What if you are using scores that are performance categories or levels? Those cannot be averaged. The best way to analyze those scores is by frequency counts, that is, how many students in each language group placed at each level each year? Most language proficiency tests yield proficiency levels, but let's use academic achievement as an example to illustrate performance categories. Under NCLB accountability, schools are required to report percentages of students who meet the standard of academic proficiency in such subjects as reading, language arts and mathematics. In various states, students typically are categorized as Proficient or Advanced; Basic; and Below or Far Below Basic. Let's examine the evaluation questions: Question 1a: Do English learners in our program show improvement in their performance in English language arts? Question 1b: How do English learners in our program compare with English learners in other programs in our district in English language arts?" Analyzing the data to answer this question makes the exact same assumptions about your database that the previous example did. The only difference is that this time, you have separate columns of scores in English language arts for two consecutive years, and those scores are in the form of performance categories or levels. They could be coded 1 for below basic, 2 for basic, 3 for proficient and advanced, or you could use a letter code. (There's an advantage to the numeric code we'll mention later. For more information on coding and data entry, see Section 4.) The easiest way to derive the answer to the question is through the crosstabs analysis, which is considered a descriptive statistic. Let's say you want to compare your thirdgrade ELL students in 2004 with the rest of the district's ELL students that year. Select the 3rd grade students for that year. Locate the Descriptives function in your statistics program. Then select Crosstabs. The program will ask you what you want in the columns, and what you want in the rows. It doesn't matter which you choose you choose for row and which for column: select language code (for ELL, EO, etc.) for one, let's say columns, then the Language Arts performance categories for the other, let's say rows. Click OK, and voila! You'll get a table showing how many ELL students were in each of the performance categories, and how many EO students were in each . Wait a minute. Our evaluation question didn't ask about comparing ELL and EO. However, given the setup of the database, this was the easiest way to get the information you wanted. You can just cut and paste, or reenter, the ELL information into a table in your word processing program and ignore the EO information. (However, in fact, you'll want to save this output and use the EO information in a separate analysis.) Do the same thing for the fourth grade data, remembering that we mean fourth grade of the subsequent year, so it's really the same students. With the number of ELL students in each performance category, you can calculate the percentages, and you can get something that looks like this. Table 2 Percentage
of ELL Third and Fourth Graders in each Performance Category
This analysis answers the second part of the evaluation question. ELL students in the program do show improvement in English language arts as measured by the state test because, from third to fourth grade, the percentages in the proficient/advanced category increased dramatically, and the percentages in the below basic category decreased dramatically. (To see this table in chart form, click here – see Figure 2 in Section 7.)
For the second part of the question, how do English learners in our program compare with English learners in other programs in our district in English language arts, the exact same information in the above table can be used. It just needs to be placed side by side with the district data. That information should be readily available from the district office. ( In California, you can consult the California Department of Education website to gather schoolwide, districtwide, and statewide data for various subgroups of studentsclick here) If you focus on the more recent data, that is, the fourth grade data, you might produce a table that looks something like this: Table 3 Comparison
of Percentage of ELL Fourth Graders in TWI vs. District
Go to Top of Page  
