SGREL Overview Talk 170127c

Automated Scoring for
Argumentative Writing:
Scoring Assistant: A Semantic Grammar
Rubric Expression Language (SGREL)
Approach
SGREL Review
Harry A. Layman - January 27, 2017
SGREL Scoring for Argumentative Writing
With Examples
Background: Value of Argumentative Writing
Challenges to Use of Argumentative Writing for instruction and

assessment
Hypothesis: Potential Improvements to Human and Automated Scoring
Solution Criteria
Solution Design
Some Examples
Literary Analysis question from ASAP competition
SGREL Syntax: identifiers, operators, expressions, functions, words &
tokens. TREs and Scoring Formulae; raw score to final score translation.
Argumentative Writing: Reading Comprehension, Information Synthesis
for Conclusion / Proposition Supported by Evidence
Graduate Medical Education: Analyze diagnostic radiology report to
auto-score real time simulations at scale
Status & Next Steps
2
Background Argumentative Writing
Value of Argumentative Writing in Assessment
Performance tasks and other forms of assessment that provide evidence of knowledge
and skill are increasingly valued for both assessment and learning purposes
Push-back on MCQ-based standardized testing grows
Authentic Assessment a key to maximizing class time for teaching and learning
curriculum, not test prep
Difference with Other forms of Assessment for Writing

Quality of writing assessments have rubrics that are generic, and are focused on
subjective attributes of writing typically evaluated by trained experts. My start in the
automated scoring field was focused on the use of machine learning for scoring essay
questions both to support domain-specific knowledge and skills assessment and to
support general-purpose writing ability assessment
Argumentative Writing attempts to state a proposition or conclusion that is supported by
evidence and reasoning. Argumentative writing that is the focus of my work is
specifically the sort for which there are right or wrong, better or worse, good and bad
answers that an item author can specify as part of the item. These items have item-
specific rubrics.
Use Cases for Argumentative Writing include Feedback, Instruction &

Assessment
3
Background Argumentative Writing
Argumentative Writing is integral to efforts to teach Critical Thinking and other initiatives (performance
tasks, problem-based learning), often linked to 21st century skills, where the focus is specifically on
reasoning, evidence and claims that support and reflect reading for comprehension, information
synthesis, reasoning skills and an analytic, data-based perspective.
Initiatives that focus on critical thinking and argumentative writing can easily generate large volumes
of writing that require feedback and / or grading of some kind.
Initiatives with a specific focus on argumentative writing include: ThinkCERCA
(https://thinkcerca.com), that provides a library of content and a sophisticated user experience to
guid students through the task of creating cogent argumentation based on a model organized
around Claims, Evidence, Reasoning, Counterarguments and use of Audience-appropriate language.
4
Background AES in Recent Years
Automated Essay Scoring for Has Made Significant Improvements
Since 2010, more and better forms of AI have been applied to auto-scoring of writing,
and reliably sorting them into the same categories (e.g., good, better, best1) as would
human graders
Generally this requires expert graded samples to build models
Generally this works well with writing tasks of well defined scope and structure; scores for more
open-ended or literary writing are more difficult to predict
Writing where even modest rigor is required making an argument about a specific text is not
working as well as holistic scoring for writing quality
Attempts to link text features to aspects of writing quality that experts use to judge
writing remain largely aspirational and marketing language often masks
computational linguistics techniques that do not match how writing is actually
evaluated by experts
Predicting human scoring from text features offers little opportunity for specific
feedback
Slow progress is being made to identify specific discourse units and to classify their role
Approximate guesses about the number, kind and distribution of discourse units as a basis for
feedback on writing organization, the development of ideas or employment of a variety of
sentence structures etc. cannot point to specific areas for improvement or of high performance
Computational linguistic analysis of overlap in word meanings, part-of-speech pairs and
vocabulary use do not provide a reliable basis for comments on appropriate use of language,
insightful word use, semantic cohesion or effective argumentation or
Examples are very helpful here; see for example www.cogwrite.com/docs/6rubrics.pdf
5
Background Inter-Rater Reliability (IRR)
Inter-Rater Reliability (IRR) is more problematic when more dimensions, and more
score points, are used in rubrics
Human-human rater agreement rates continue to limit the potential of computer-based
scoring that models these human scoring outcomes
Acceptable levels of human IRR include agreement rates at or near correlations of
0.7 for both Pearson's R and quadratic weighted kappa essentially the threshold at
where signal exceeds noise in the relationship between scores assigned by different
raters
Major test programs that had moved to 1 plus 1 scoring for essays that had holistic, 1
to 6 scoring scales have retreated from automated scoring with the advent of multi-
dimensional scoring models
Three (highly correlated) sub-scores with scales of 1 to 4 rather than a single holistic
score of 1 to 6 have made modeling scoring behavior from text more problematic
Models that predict the best overall score often are not the best models for
predicting sub-scores
Incremental measurement gains from such essay have been sufficiently modest
that at least in some high profile cases the essay scoring aspects of general
assessment programs have been made optional
Metrics used in standardized testing for scoring quality, and inter-rater agreement,
remain opaque. High level descriptive statistics do not present convincing evidence of
good performance across the full range of student abilities and student subgroups
6
7
Measurement Quality: What level of IRR is sufficient, particularly without
supporting evidence or detail?
H Are these good enough?
If a zero score means un-scorable, does the IRR for the item on the left below become
less acceptable when only scored items are considered as on the right?
QWKs: top: 0.60, 0.68; bottom: 0.85, 0.71
8
Challenges to Use of Argumentative Writing
For teachers: time consuming, resource intensive
Deliberate practice with significant long form writing is
constrained by availability of teaching / grading resources
Even classroom assessment largely eschews constructed response
in favor of multiple choice questions for reasons of both
economics, measurement efficiency and cost (class time and
grading time)
Programs for online writing with feedback abound. Do they work?
For assessment at scale: human scoring is expensive; attempts to
score with more nuance and detail will further challenge inter-rater
reliability and likely increase grading time
For assessment at scale: automated scoring that relies on
mimicking human scoring outcomes is susceptible to gaming and
coaching, has credibility issues due to non-construct-relevant
scoring approaches, and so far -- provides limited feedback and
defensibility on specific results
9
Challenges in Automated Scoring of CR Items
Leading vendors and practitioners continue to work on providing
automated solutions to CR items, including various kinds of writing
Standardized Testing for K12 has tried and failed to harness
automated scoring at scale for important, sophisticated assessment
Most scoring solutions are focused on existing item types, rubrics
and delivery systems
Even as some progress has been made, tests are evolving and
scoring is becoming more demanding (e.g., placement, granular
accountability, cohort progress, etc. all in one test!),
Item production remains a non-scalable, guild-level enterprise.
Work on new approaches (item families, auto-generation)
remains mostly in the research community.
Predicting item difficulty, ability to differentiate, and (for CR)
scoring reliability require costly field trials and further constrain
item production
10
Hypothesis
Human scoring could be improved with more detailed, item specific
rubrics
Itemized and specific guidance on features and concepts to score,
their relative value and specific characteristics, would result in more
consistent judgements by scorers
Automated support that visually associates response elements to
specific rubric requirements could turbo-charge human scoring
activity while enhancing quality
Automated scoring for argumentation, critical thinking and problem
solving would gain acceptance more readily if
Scoring judgments were not overly reliant on word choice and
grammar
Automated scoring was transparent and self-documenting, providing
detailed support with each micro-decision that contributes to a
score
Supplemental information could be provided along with scores
identifying aspects of a response that contributed to a lower score,
and aspects of a response that were missing that could have
contributed to improvement
11
Solution Criteria: Add Rigor and Reliability
Automated scoring solutions for CR items that mimic human
scores, using complex text features, computational linguistics
and machine learning, are not progressing toward defensible, self-
documenting scoring
Scoring Argumentative Writing with detail and specificity will add

value to assessment as instruction and solve the defensibility
(why is this score a 3 and not a 4?) and reliability (two scorers
differ in their application of a generic rubric to a particular item)
challenges simply by:
Making explicit every scoring decision that supports the final score
Tying each micro-decision on scoring to specific elements of the rubric and
the response -- and (in some cases) content within an item.
A detailed score report can identify every element of the rubric that was,
and was not, satisfied by a response, allowing a human scorer to confirm or
override (and remap) these decisions if desired, but establishing an explicit
basis for any score assigned that can be directly compared to any other
score for any other response with that rubric.
12
Solution Design: Detailed Rubrics and an
Intelligent Scoring Assistant
Item specific rubrics, expressed comprehensively and with rigor
Each items rubric includes a collection of Target response elements each of a TRE Type that
(may be?) specific to specific question types, and a scoring formula that specifies the rules for
combining the results of applying the TREs to a response, aggregating the results by TRE type and
group, and combining those results to an overall score - which might then be scaled for use in a
specific assessment.
The enabling semantic grammar for rubric expression is a work in process.
Today I will provide some highlights / snapshots how the rubrics might look for three
different cases
A Scoring Engine will apply the TREs to the Item Responses using AI-related tools and
techniques (NLP, WordNet)
I have been prototyping aspects of this processing and am ready to begin detail
design / construction
A Scoring Assistant will combine the results of the Scoring Engines rubric / response
processing into an overall item score and a detailed map that shows specifically how
the combination of the rubric and the response yielded the resulting score
Every scoring decision linked to a specific rubric element and response element
Automated application of rubric scoring rules can provide first-order result
Human graders can validate application or rubric to response and adjust as necessary
Scores are self-documenting, transparent and provide students with instructionally relevant feedback
13
Example 1: Literary Analysis Question
This item was taken from the 2012 ASAP Contest run on Kaggle.com
This item provides a 1350 word short story that concludes:

"When they come back, Saeng vowed silently to herself, in the spring, when the snows melt and the geese return and this
hibiscus is budding, then I will take that test again."
As used in an actual assessment, the instructions suggest that students

support their ideas with details and examples from the story. This item
was scored as an open ended question, looking only at whether or not
the response made any sense at all that its, that it gave an answer to the
question, and then if it also provided evidence to support the proposition
asserted. The item rubric specifies a scale of 0 to 3 (with zero being off
topic, no response, or unintelligible).
A pure machine learning approach to modeling the scoring of these

responses does a fairly typical job of predicting scores about as well as
one graders score on an essay would predict anothers which is to say,
borderline in terms of criteria used by many test publishers as the
threshold of reliability that establishes its validity for use.
I believe a close reading of this story suggests very strongly a right

answer to the question, and some potential less good responses.
14
As there are 1700+ responses to this item available on the Kaggle
web site, each with two human scores, I have selected this item
and data for re-use.
I have crafted a new rubric to judge the answer, expressed in my
purpose-built semantic grammar rubric expression language.1
I will re-score many / all of the item responses using trained scorers
and the new rubric.
I will implement a scoring engine that can apply the new rubric to
the student responses and generate scores, and score these items
with it.
I will also score the items with pure machine learning, and then
compare the performance of the scoring systems with each other
to see how the different approaches compare to each other and
to the human scores.
1) The rubric is still being refined. It should be understood throughout this discussion that I am continuing to refine and iterate on
every aspect of the solution as I gain experience and insight into what works.
15
Original Rubric Guidelines
Score 3: The response demonstrates an understanding of the complexities of the text.

Addresses the demands of the question
Uses expressed and implied information from the text
Clarifies and extends understanding beyond the literal
Score 2: The response demonstrates a partial or literal understanding of the text.

Addresses the demands of the question, although may not develop all parts equally
Uses some expressed or implied information from the text to demonstrate understanding
May not fully connect the support to a conclusion or assertion made about the text(s)
Score 1: The response shows evidence of a minimal understanding of the text.

May show evidence that some meaning has been derived from the text
May indicate a misreading of the text or the question
May lack information or explanation to support an understanding of the text in relation to the question
Score 0: The response is completely irrelevant or incorrect, or there is no response.
Adjudication Rules
If Reader-1 Score and Reader-2 Score are exact or adjacent, adjudication by a third reader is not required.
16
The Argumentative Writing Rubric using SGREL Syntax for this item, Winter Hibiscus, will have
the following components: TRE definition, formula score definition and score scaling
definition.
Target Response Element (TRE) Definition by TRE Type
CP:TREs 4(?) TREs of type CP (Conclusion / Proclamation)
There will be a set of CP type TREs that reflect the possible correct (to some degree)
answers to the central question of why this concluding paragraph was used.
I currently have four candidate expressions to score the main proposition(s) in responses that
will count as wholly or partially correct, and which assign 8, 6, 4 and 2 points, respectively,
with a maximum of 8 points (representing half the value of the item) in the event more than
one proposition is found.
RR:TRES 5 TREs of type RR: Reasoning and Rationale
Five specific reasoning elements will be allowed for credit on this item, each
worth one point each, with a maximum of 4 points allowed.
17
EV: TREs The third TRE type scored for this item will be for the use of evidence
Approximately13 TREs of type EV: Evidence
3 TREs of type EV will be worth 2 points each, and the other
10 TREs of type EV will be worth 1 point each, with a maximum of 4 points allowed
for all EV type TREs.
In this way half the point score is for the proposition, and a quarter of the score is for each of
the reasoning and evidence factors. There is some overlap between citing evidence and
connecting evidence to reasoning, and as I work through this process, the definition of the
types of TREs and their specific constitution could change.
Note that TREs can be grouped with group level scoring rules (caps and overrides), and
these groups may (in other cases) included TREs of the same or a mix of TRE types.
Some examples of these TREs will be presented in a moment, but the SGREL rubric has two
more components besides the TREs): the scoring formula and the final scaling.
18
Scoring Formula for Winter Hibiscus will be as follows:
Total raw score = PC Score + RR score + EV score (recognizing the maximum point
values for each component, and that the value of each matched TRE for each
type has a point value defined as part of the TRE itself.
Total scaled score for this item will be:
Raw Score Scaled Score Notes

0 0 No TREs match
1 to 7 1 A bit of evidence or reasoning or

both, or a partial identification of
the central theme
8 to 10 2 More evidence, better

recognition of the them
11 to 16 3 Identify correct analogy and

provide some support, or get
close on the central them and
cite much evidence.
19
Sample Target Response Element Definitions for the proposition / conclusion:
The PC TREs for this item will reflect these three targeted response elements:
The final paragraph signals the significance, and teenagers recognition, of the over-arching
metaphor the story communications: that of the adaptation of the winter hibiscus to its
environment, and the struggle required by the immigrant teenage girl to adapt to her new
environment. (8)
The final paragraph signals Saengs determination to adapt and succeed (6)
Adaptation is matter of both struggle and accommodation, and the adapter is changed in the
process becoming stronger and yet different. (4)
Life is about change and how to respond to change. A life is made a series of choices a person
must make as they grow: what to hold on to, what to treasure, and what to value and how to
adapt. (2)
#PC-TRE1 ::= [hibiscus | flower | *plant ] + [girl | *author | Saeng | *speaker |
daughter ] + [analogy | parallel | similar | *metaphor | between ] = 8
Notation can define words, *concepts (a word and synsets); additional operators / modifiers
under consideration / development / trial). TRE matching currently maps expressions to
sentences (augmented via reference resolution); word order within TREs and response
sentence elements is presently ignored.
(The notation will be explained at this point and is detailed a bit in the following 3 slides.)
20
SGREL Syntax: Identifiers, Operators, Words & Tokens
Sample TRE Definitions (continued)
PC-TRE1 ::= [hibiscus | flower | *plant ] + [girl | *author | Saeng | *speaker |

daughter ] + [analogy | parallel | similar | *metaphor | between ] = 8
Example of the Python Grammar Specification for the sort of notation I am

borrowing (see https://docs.python.org/3/reference/grammar.html)
21
22
max( expr1, expr2) A function that returns the greater of the two expressions
when evaluated
23
Sample Scoring Report
Sample essay 4_8865:
The author concludes the story with that passage to show the importance of the inspiration Saeng gets from
the hibiscus plant. Saeng, throughout the story was comforted by the hibiscus plant because it reminded her
of home. The plant during the winter metaphorically explains: @CAPS1 attitude towards her new country
and her driving test; the hibiscus plant in the winter is not as beautiful in the bitter cold, but it adapts and
survives, and returns to its beautiful state in the spring. Saeng is bitter about her new country and driving test,
but is adapting, and will be inspired by the beautiful state of the hibiscus in the spring to try her test again. In
conclusion, the author ended the story in that way to stress importance in the relationship between Saeng
and the hibiscus plant.
PC (max 8): 8 points explicit recognition of the underlying analogy
RR (max 4): 4 points plant adapts and survives; Saeng is adapting.
EV (max 4): 2 points winter hibiscus had adapted
Raw score: 14 of 16. Final Score: 3
Additional scoring detail: Additional points could have been awarded for additional
evidence as follows: a, b, c etc.
24
Example 2: Critical Thinking Challenge
This is a new item for which I am collecting new response data.
This item has two questions. Question 1 states:

Based on the Economist article, define the Seebeck effect and describe the role that
graphene could play in creating a workable Seebeck-effect-powered electrical
generation capability.
The scoring formula for this question will include elements for describing the role correctly, and
specifically consider elements of the answer that a) defines the Seebeck effect (6 points) and b) defines
the contribution of graphene to improving Seebeck effect in some materials and contexts (6 points).
Part 1a: Define Seebeck effect

#TRE1-EV-SBDFN ::= [seebeck effect] + [*property + *material + *heat + *causes +
[current | electricity | flow ]] = 6
Part 1b: Describe role of graphene

#TRE3A-EV-ROG ::= [*increase + *temperature + *range] = 3
#TRE3B-EV-ROG ::= [*increase + *electrical + *conductivity] = 3
#TRE3C-EV-ROG ::= [*decrease + *thermal + *conductivity] = 1
#TRE3D-EV-ROG ::= [[*useful | operational] + *temperature + *range] = 2
#TRE3E-EV-ROG ::= [[*increase | *convert] + [from 1% | to 5%] +heat + *electricity] = 3
TRE3 = max(6, [#TRE3A-EV-ROG + #TRE3B-EV-ROG + #TRE3C-EV-ROG + #TRE3D-EV-ROG + #TRE3E-EV-ROG])
25
Example 2: Critical Thinking Challenge
Seebeck Item, Question 1:
Based on the Economist article, define the Seebeck effect and describe the role that
graphene could play in creating a workable Seebeck-effect-powered electrical
generation capability.
Sample response 5_1604:

1. The Seebeck effect is a property of some materials whereby heating part of an object made of that
material drives electrons from the hot part to the cold part, creating a current. Experiments in the past
have yielded results that only performed properly only between 500 and 750C. In the most recent
trials, Graphene has been introduced to the formula with positive results. Dr. Freer and Dr. Kinloch found
that the optimal proportion is 0.6%. This creates a material which has a working range that stretches
from room temperature (about 20C) to 750C, and converts up to 5% of the supplied heat into
electricity. (Economist)
Scoring:
TRE1-EV-SBDFN (max 6): 6 points, property drives electrons hot to cold creating current.
TRE3-EV-ROG (max 6):
A: 3 points stretches from room temperature
E: 3 points converts up to 5% heat electricity
Part 1: Score 6 of 6.
26
Example 3: Diagnostic Radiology Exam
Background
Significant study and investment across a range of institutions the National Academy of Sciences, The
Institute of Medicine, The National Research Council -- and a broad range of agencies and programs
responsible directly and indirectly for funding and guiding Graduate Medical Education have aligned their
efforts to invigorate and upgrade Graduate Medical Education via a number of new initiatives with a focus
on leveraging advanced technology for better, more effective instruction and educational outcomes in the
face of accelerating leaps in medical knowledge, diagnostic tools and interventional capabilities.
Educational Context
The American College of Radiology, a leading education not-for-profit in the medical imaging and medical
informatics areas, has developed a learning and assessment platform with advanced capabilities to support
state-of-the-art diagnostic radiology education and assessment. This platform fully integrating sophisticated,
clinical-quality 3d-image viewing (DICOM image sets) and a rich medical case data model. One
application of this technology has been the development of a ready for call real-time simulation
assessment that puts a radiology resident through a simulated, 8 hour rotation during which they will be
served a broad range of challenging cases (with imaging and case data)that test a comprehensive range
of skills with different imaging modalities, medical specialties, anatomical systems and medical pathologies
(including cases where no abnormalities exist).
One goal of my focus on rigorous rubrics was to enable scoring metrics typical of diagnostic medical
settings, where scoring is often based on a collection of indicators observations, interpretations,
descriptions and recommended actions that can vary widely in importance, and may have override
factors (where a potentially fatal miss is of utmost concern) or where different combinations of findings
reflect varying gradations of skill, understanding and mastery.
27
Standard Diagnostic Assessment Item Format
A medical case is presented to the student with a robust set of data and including appropriate medical
imaging that was ordered based on symptoms identified in the case.
A typical assessment item would be to present the complete set of case data and notes as well as one or
more sets of medical images (which could include MRI, CT, PET-CT, X-fay, Doppler Ultrasound, etc.) and a
platform that supports a full 3d-viewer with controls and capabilities familiar to the student.
The assessment item task is almost universally something like Review the case data and imaging and write
up your findings.
Rubrics for scoring a response may include:

A list of expected observations and point values reflecting relative importance. Some
observations may be partial, with some credit given, but ignored if a more
comprehensive observation subsumes it and assigns a higher point value
One or more potential interpretations that have positive point values associated with
correct interpretations and negative point values associated with incorrect
interpretations
Override or high value results that alone dictate a specific evaluation (e.g. fail or
zero points) for critical (e.g. potential fatal) mistakes
28
This is a typical item from a hypothetical diagnostic imaging exam.
This item requires the student to analyze and write up findings from examining a
set of [MRI, PET-CT, CT with Contrast, etc.] image sets appropriate to the case
data provided, or even, perhaps, in the absence of relevant case data.)
A simple one-part rubric (holistic score) might look like this:

#TRE1A-EV-AAA ::= [observation + using + proper + terms | or + alternates] = 5
#TRE1B-EV-AAA ::= [[related + other + observation] | alternate + terminology] = 3
#TRE1C-EV-AAA ::= [partial + read] = 2
#TRE1D-EV-AAA ::= [minor + but + relevant] = 1
#TRE1E-EV-AAA ::= [miss-read | miss-read2 | miss-characterization] = -1
#TRE1 = max([max([#TRE1A-EV-AAA + #TRE1B-EV-AAA + #TRE1C-EV-AAA], #TRE1D-EV-AAA)] + #TRE1E-EV-AAA],0)
In English: add 1a/1b/1c if found; use this value, but if zero, use 1d if found. Subtract 1 from the
score if any factor in 1E is identified but use a score of zero if negative.
29
30
This is a typical item from a hypothetical diagnostic imaging exam.
This task is to analyze and write up findings from examining a set of [MRI, PET-CT,
CT with Contrast, etc.] image sets appropriate to the case data provided.
This diagnosis expects some important observations, interpretations and one
key finding that, if missing, is a fatal miss.
A two-part rubric (observation and interpretation) might look like this (with placeholder data):
#TRE1A-EV-AAA ::= [observation + using + proper + terms | or + alternates] = 8
#TRE1B-EV-AAA ::= [[related + other + observation] | alternate + terminology] = 2
#TRE1 = [#TRE1A-EV-AAA + #TRE1B-EV-AAA]
#TRE2A-RR-AAA ::= [interpretation + value] = 10

#TRE2B-RR-AAA ::= [miss-read + but + relevant] = 3
#TRE2 = max(#TRE2A-RR-AAA, #TRE2B-RR-AAA)
#TRE3A-EV-AAA ::= [Normal | Negative | WNL] = -20

#ITEM153701B-RS = max(0, [#TRE1 + #TRE2 + TRE3A-EV-AAA]) Note: poss. scores: range from 0 to 20.
31
Status & Next Steps
Refine SGREL syntax as I encode rubrics for three sample
problems
Create Proof-Of-Concept Scoring engine (sketch on next slide)
Collect Responses to Argumentative Writing Challenge #1
Identify additional Item Data Sources or Collect second item

(or both)
Create initial SGREL-based scoring capability and compare

SGREL, Human and ML scoring for each data set / item
Iterate / Refine / re-define
32
Status & Next Steps
33
Questions and Discussion
Harry Layman
CogWrite Learning Analytics LLC
5151 California Ave, Ste 100, Irvine, CA 92157
(949) 945-3373
Harry.Layman@CogWrite.com
www.CogWrite.com
34

SGREL Overview Talk 170127c

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SGREL Overview Talk 170127c

Uploaded by

Copyright:

Available Formats

Automated Scoring for

Challenges to Use of Argumentative Writing for instruction and

Hypothesis: Potential Improvements to Human and Automated Scoring

Status & Next Steps

Difference with Other forms of Assessment for Writing

Use Cases for Argumentative Writing include Feedback, Instruction &

H Are these good enough?

QWKs: top: 0.60, 0.68; bottom: 0.85, 0.71

Scoring Argumentative Writing with detail and specificity will add

This item provides a 1350 word short story that concludes:

As used in an actual assessment, the instructions suggest that students

A pure machine learning approach to modeling the scoring of these

I believe a close reading of this story suggests very strongly a right

Score 3: The response demonstrates an understanding of the complexities of the text.

Score 2: The response demonstrates a partial or literal understanding of the text.

Score 1: The response shows evidence of a minimal understanding of the text.

Score 0: The response is completely irrelevant or incorrect, or there is no response.

Target Response Element (TRE) Definition by TRE Type

CP:TREs 4(?) TREs of type CP (Conclusion / Proclamation)

RR:TRES 5 TREs of type RR: Reasoning and Rationale

Approximately13 TREs of type EV: Evidence

3 TREs of type EV will be worth 2 points each, and the other

Total scaled score for this item will be:

Raw Score Scaled Score Notes

1 to 7 1 A bit of evidence or reasoning or

8 to 10 2 More evidence, better

11 to 16 3 Identify correct analogy and

PC-TRE1 ::= [hibiscus | flower | *plant ] + [girl | *author | Saeng | *speaker |

Example of the Python Grammar Specification for the sort of notation I am

PC (max 8): 8 points explicit recognition of the underlying analogy

RR (max 4): 4 points plant adapts and survives; Saeng is adapting.

EV (max 4): 2 points winter hibiscus had adapted

Raw score: 14 of 16. Final Score: 3

This item has two questions. Question 1 states:

Part 1a: Define Seebeck effect

Part 1b: Describe role of graphene

TRE3 = max(6, [#TRE3A-EV-ROG + #TRE3B-EV-ROG + #TRE3C-EV-ROG + #TRE3D-EV-ROG + #TRE3E-EV-ROG])

Sample response 5_1604:

Rubrics for scoring a response may include:

A simple one-part rubric (holistic score) might look like this:

#TRE1B-EV-AAA ::= [[related + other + observation] | alternate + terminology] = 3

#TRE1C-EV-AAA ::= [partial + read] = 2

#TRE1D-EV-AAA ::= [minor + but + relevant] = 1

#TRE1E-EV-AAA ::= [miss-read | miss-read2 | miss-characterization] = -1

#TRE1 = max([max([#TRE1A-EV-AAA + #TRE1B-EV-AAA + #TRE1C-EV-AAA], #TRE1D-EV-AAA)] + #TRE1E-EV-AAA],0)

#TRE2A-RR-AAA ::= [interpretation + value] = 10

#TRE3A-EV-AAA ::= [Normal | Negative | WNL] = -20

Create Proof-Of-Concept Scoring engine (sketch on next slide)

Collect Responses to Argumentative Writing Challenge #1

Identify additional Item Data Sources or Collect second item

Create initial SGREL-based scoring capability and compare

Iterate / Refine / re-define

CogWrite Learning Analytics LLC

5151 California Ave, Ste 100, Irvine, CA 92157

You might also like

PC-TRE1 ::= [hibiscus | flower | plant ] + [girl | author | Saeng | *speaker |