Professional Documents
Culture Documents
Argumentative Writing:
Scoring Assistant: A Semantic Grammar
Rubric Expression Language (SGREL)
Approach
SGREL Review
Harry A. Layman - January 27, 2017
SGREL Scoring for Argumentative Writing
With Examples
Background: Value of Argumentative Writing
Solution Criteria
Solution Design
Some Examples
Literary Analysis question from ASAP competition
SGREL Syntax: identifiers, operators, expressions, functions, words &
tokens. TREs and Scoring Formulae; raw score to final score translation.
Argumentative Writing: Reading Comprehension, Information Synthesis
for Conclusion / Proposition Supported by Evidence
Graduate Medical Education: Analyze diagnostic radiology report to
auto-score real time simulations at scale
2
Background Argumentative Writing
Value of Argumentative Writing in Assessment
Performance tasks and other forms of assessment that provide evidence of knowledge
and skill are increasingly valued for both assessment and learning purposes
Push-back on MCQ-based standardized testing grows
Authentic Assessment a key to maximizing class time for teaching and learning
curriculum, not test prep
3
Background Argumentative Writing
Argumentative Writing is integral to efforts to teach Critical Thinking and other initiatives (performance
tasks, problem-based learning), often linked to 21st century skills, where the focus is specifically on
reasoning, evidence and claims that support and reflect reading for comprehension, information
synthesis, reasoning skills and an analytic, data-based perspective.
Initiatives that focus on critical thinking and argumentative writing can easily generate large volumes
of writing that require feedback and / or grading of some kind.
Initiatives with a specific focus on argumentative writing include: ThinkCERCA
(https://thinkcerca.com), that provides a library of content and a sophisticated user experience to
guid students through the task of creating cogent argumentation based on a model organized
around Claims, Evidence, Reasoning, Counterarguments and use of Audience-appropriate language.
4
Background AES in Recent Years
Automated Essay Scoring for Has Made Significant Improvements
Since 2010, more and better forms of AI have been applied to auto-scoring of writing,
and reliably sorting them into the same categories (e.g., good, better, best1) as would
human graders
Generally this requires expert graded samples to build models
Generally this works well with writing tasks of well defined scope and structure; scores for more
open-ended or literary writing are more difficult to predict
Writing where even modest rigor is required making an argument about a specific text is not
working as well as holistic scoring for writing quality
Attempts to link text features to aspects of writing quality that experts use to judge
writing remain largely aspirational and marketing language often masks
computational linguistics techniques that do not match how writing is actually
evaluated by experts
Predicting human scoring from text features offers little opportunity for specific
feedback
Slow progress is being made to identify specific discourse units and to classify their role
Approximate guesses about the number, kind and distribution of discourse units as a basis for
feedback on writing organization, the development of ideas or employment of a variety of
sentence structures etc. cannot point to specific areas for improvement or of high performance
Computational linguistic analysis of overlap in word meanings, part-of-speech pairs and
vocabulary use do not provide a reliable basis for comments on appropriate use of language,
insightful word use, semantic cohesion or effective argumentation or
Examples are very helpful here; see for example www.cogwrite.com/docs/6rubrics.pdf
5
Background Inter-Rater Reliability (IRR)
Inter-Rater Reliability (IRR) is more problematic when more dimensions, and more
score points, are used in rubrics
Human-human rater agreement rates continue to limit the potential of computer-based
scoring that models these human scoring outcomes
Acceptable levels of human IRR include agreement rates at or near correlations of
0.7 for both Pearson's R and quadratic weighted kappa essentially the threshold at
where signal exceeds noise in the relationship between scores assigned by different
raters
Major test programs that had moved to 1 plus 1 scoring for essays that had holistic, 1
to 6 scoring scales have retreated from automated scoring with the advent of multi-
dimensional scoring models
Three (highly correlated) sub-scores with scales of 1 to 4 rather than a single holistic
score of 1 to 6 have made modeling scoring behavior from text more problematic
Models that predict the best overall score often are not the best models for
predicting sub-scores
Incremental measurement gains from such essay have been sufficiently modest
that at least in some high profile cases the essay scoring aspects of general
assessment programs have been made optional
Metrics used in standardized testing for scoring quality, and inter-rater agreement,
remain opaque. High level descriptive statistics do not present convincing evidence of
good performance across the full range of student abilities and student subgroups
6
Background Inter-Rater Reliability (IRR)
7
Background Inter-Rater Reliability (IRR)
Measurement Quality: What level of IRR is sufficient, particularly without
supporting evidence or detail?
If a zero score means un-scorable, does the IRR for the item on the left below become
less acceptable when only scored items are considered as on the right?
8
Challenges to Use of Argumentative Writing
For teachers: time consuming, resource intensive
Deliberate practice with significant long form writing is
constrained by availability of teaching / grading resources
Even classroom assessment largely eschews constructed response
in favor of multiple choice questions for reasons of both
economics, measurement efficiency and cost (class time and
grading time)
Programs for online writing with feedback abound. Do they work?
For assessment at scale: human scoring is expensive; attempts to
score with more nuance and detail will further challenge inter-rater
reliability and likely increase grading time
For assessment at scale: automated scoring that relies on
mimicking human scoring outcomes is susceptible to gaming and
coaching, has credibility issues due to non-construct-relevant
scoring approaches, and so far -- provides limited feedback and
defensibility on specific results
9
Challenges in Automated Scoring of CR Items
Leading vendors and practitioners continue to work on providing
automated solutions to CR items, including various kinds of writing
Standardized Testing for K12 has tried and failed to harness
automated scoring at scale for important, sophisticated assessment
Most scoring solutions are focused on existing item types, rubrics
and delivery systems
Even as some progress has been made, tests are evolving and
scoring is becoming more demanding (e.g., placement, granular
accountability, cohort progress, etc. all in one test!),
Item production remains a non-scalable, guild-level enterprise.
Work on new approaches (item families, auto-generation)
remains mostly in the research community.
Predicting item difficulty, ability to differentiate, and (for CR)
scoring reliability require costly field trials and further constrain
item production
10
Hypothesis
Human scoring could be improved with more detailed, item specific
rubrics
Itemized and specific guidance on features and concepts to score,
their relative value and specific characteristics, would result in more
consistent judgements by scorers
Automated support that visually associates response elements to
specific rubric requirements could turbo-charge human scoring
activity while enhancing quality
Automated scoring for argumentation, critical thinking and problem
solving would gain acceptance more readily if
Scoring judgments were not overly reliant on word choice and
grammar
Automated scoring was transparent and self-documenting, providing
detailed support with each micro-decision that contributes to a
score
Supplemental information could be provided along with scores
identifying aspects of a response that contributed to a lower score,
and aspects of a response that were missing that could have
contributed to improvement
11
Solution Criteria: Add Rigor and Reliability
Automated scoring solutions for CR items that mimic human
scores, using complex text features, computational linguistics
and machine learning, are not progressing toward defensible, self-
documenting scoring
12
Solution Design: Detailed Rubrics and an
Intelligent Scoring Assistant
Item specific rubrics, expressed comprehensively and with rigor
Each items rubric includes a collection of Target response elements each of a TRE Type that
(may be?) specific to specific question types, and a scoring formula that specifies the rules for
combining the results of applying the TREs to a response, aggregating the results by TRE type and
group, and combining those results to an overall score - which might then be scaled for use in a
specific assessment.
The enabling semantic grammar for rubric expression is a work in process.
Today I will provide some highlights / snapshots how the rubrics might look for three
different cases
A Scoring Engine will apply the TREs to the Item Responses using AI-related tools and
techniques (NLP, WordNet)
I have been prototyping aspects of this processing and am ready to begin detail
design / construction
A Scoring Assistant will combine the results of the Scoring Engines rubric / response
processing into an overall item score and a detailed map that shows specifically how
the combination of the rubric and the response yielded the resulting score
Every scoring decision linked to a specific rubric element and response element
Automated application of rubric scoring rules can provide first-order result
Human graders can validate application or rubric to response and adjust as necessary
Scores are self-documenting, transparent and provide students with instructionally relevant feedback
13
Example 1: Literary Analysis Question
This item was taken from the 2012 ASAP Contest run on Kaggle.com
14
Example 1: Literary Analysis Question
As there are 1700+ responses to this item available on the Kaggle
web site, each with two human scores, I have selected this item
and data for re-use.
I have crafted a new rubric to judge the answer, expressed in my
purpose-built semantic grammar rubric expression language.1
I will re-score many / all of the item responses using trained scorers
and the new rubric.
I will implement a scoring engine that can apply the new rubric to
the student responses and generate scores, and score these items
with it.
I will also score the items with pure machine learning, and then
compare the performance of the scoring systems with each other
to see how the different approaches compare to each other and
to the human scores.
1) The rubric is still being refined. It should be understood throughout this discussion that I am continuing to refine and iterate on
every aspect of the solution as I gain experience and insight into what works.
15
Example 1: Literary Analysis Question
Original Rubric Guidelines
Adjudication Rules
If Reader-1 Score and Reader-2 Score are exact or adjacent, adjudication by a third reader is not required.
16
Example 1: Literary Analysis Question
The Argumentative Writing Rubric using SGREL Syntax for this item, Winter Hibiscus, will have
the following components: TRE definition, formula score definition and score scaling
definition.
There will be a set of CP type TREs that reflect the possible correct (to some degree)
answers to the central question of why this concluding paragraph was used.
I currently have four candidate expressions to score the main proposition(s) in responses that
will count as wholly or partially correct, and which assign 8, 6, 4 and 2 points, respectively,
with a maximum of 8 points (representing half the value of the item) in the event more than
one proposition is found.
Five specific reasoning elements will be allowed for credit on this item, each
worth one point each, with a maximum of 4 points allowed.
17
Example 1: Literary Analysis Question
EV: TREs The third TRE type scored for this item will be for the use of evidence
10 TREs of type EV will be worth 1 point each, with a maximum of 4 points allowed
for all EV type TREs.
In this way half the point score is for the proposition, and a quarter of the score is for each of
the reasoning and evidence factors. There is some overlap between citing evidence and
connecting evidence to reasoning, and as I work through this process, the definition of the
types of TREs and their specific constitution could change.
Note that TREs can be grouped with group level scoring rules (caps and overrides), and
these groups may (in other cases) included TREs of the same or a mix of TRE types.
Some examples of these TREs will be presented in a moment, but the SGREL rubric has two
more components besides the TREs): the scoring formula and the final scaling.
18
Example 1: Literary Analysis Question
Scoring Formula for Winter Hibiscus will be as follows:
Total raw score = PC Score + RR score + EV score (recognizing the maximum point
values for each component, and that the value of each matched TRE for each
type has a point value defined as part of the TRE itself.
19
Example 1: Literary Analysis Question
Sample Target Response Element Definitions for the proposition / conclusion:
The PC TREs for this item will reflect these three targeted response elements:
The final paragraph signals the significance, and teenagers recognition, of the over-arching
metaphor the story communications: that of the adaptation of the winter hibiscus to its
environment, and the struggle required by the immigrant teenage girl to adapt to her new
environment. (8)
The final paragraph signals Saengs determination to adapt and succeed (6)
Adaptation is matter of both struggle and accommodation, and the adapter is changed in the
process becoming stronger and yet different. (4)
Life is about change and how to respond to change. A life is made a series of choices a person
must make as they grow: what to hold on to, what to treasure, and what to value and how to
adapt. (2)
#PC-TRE1 ::= [hibiscus | flower | *plant ] + [girl | *author | Saeng | *speaker |
daughter ] + [analogy | parallel | similar | *metaphor | between ] = 8
Notation can define words, *concepts (a word and synsets); additional operators / modifiers
under consideration / development / trial). TRE matching currently maps expressions to
sentences (augmented via reference resolution); word order within TREs and response
sentence elements is presently ignored.
(The notation will be explained at this point and is detailed a bit in the following 3 slides.)
20
SGREL Syntax: Identifiers, Operators, Words & Tokens
Sample TRE Definitions (continued)
21
SGREL Syntax: Identifiers, Operators, Words & Tokens
22
SGREL Syntax: Identifiers, Operators, Words & Tokens
max( expr1, expr2) A function that returns the greater of the two expressions
when evaluated
23
Example 1: Literary Analysis Question
Sample Scoring Report
Sample essay 4_8865:
The author concludes the story with that passage to show the importance of the inspiration Saeng gets from
the hibiscus plant. Saeng, throughout the story was comforted by the hibiscus plant because it reminded her
of home. The plant during the winter metaphorically explains: @CAPS1 attitude towards her new country
and her driving test; the hibiscus plant in the winter is not as beautiful in the bitter cold, but it adapts and
survives, and returns to its beautiful state in the spring. Saeng is bitter about her new country and driving test,
but is adapting, and will be inspired by the beautiful state of the hibiscus in the spring to try her test again. In
conclusion, the author ended the story in that way to stress importance in the relationship between Saeng
and the hibiscus plant.
Additional scoring detail: Additional points could have been awarded for additional
evidence as follows: a, b, c etc.
24
Example 2: Critical Thinking Challenge
This is a new item for which I am collecting new response data.
25
Example 2: Critical Thinking Challenge
Seebeck Item, Question 1:
Based on the Economist article, define the Seebeck effect and describe the role that
graphene could play in creating a workable Seebeck-effect-powered electrical
generation capability.
Scoring:
TRE1-EV-SBDFN (max 6): 6 points, property drives electrons hot to cold creating current.
TRE3-EV-ROG (max 6):
A: 3 points stretches from room temperature
E: 3 points converts up to 5% heat electricity
Part 1: Score 6 of 6.
26
Example 3: Diagnostic Radiology Exam
Background
Significant study and investment across a range of institutions the National Academy of Sciences, The
Institute of Medicine, The National Research Council -- and a broad range of agencies and programs
responsible directly and indirectly for funding and guiding Graduate Medical Education have aligned their
efforts to invigorate and upgrade Graduate Medical Education via a number of new initiatives with a focus
on leveraging advanced technology for better, more effective instruction and educational outcomes in the
face of accelerating leaps in medical knowledge, diagnostic tools and interventional capabilities.
Educational Context
The American College of Radiology, a leading education not-for-profit in the medical imaging and medical
informatics areas, has developed a learning and assessment platform with advanced capabilities to support
state-of-the-art diagnostic radiology education and assessment. This platform fully integrating sophisticated,
clinical-quality 3d-image viewing (DICOM image sets) and a rich medical case data model. One
application of this technology has been the development of a ready for call real-time simulation
assessment that puts a radiology resident through a simulated, 8 hour rotation during which they will be
served a broad range of challenging cases (with imaging and case data)that test a comprehensive range
of skills with different imaging modalities, medical specialties, anatomical systems and medical pathologies
(including cases where no abnormalities exist).
One goal of my focus on rigorous rubrics was to enable scoring metrics typical of diagnostic medical
settings, where scoring is often based on a collection of indicators observations, interpretations,
descriptions and recommended actions that can vary widely in importance, and may have override
factors (where a potentially fatal miss is of utmost concern) or where different combinations of findings
reflect varying gradations of skill, understanding and mastery.
27
Example 3: Diagnostic Radiology Exam
Standard Diagnostic Assessment Item Format
A medical case is presented to the student with a robust set of data and including appropriate medical
imaging that was ordered based on symptoms identified in the case.
A typical assessment item would be to present the complete set of case data and notes as well as one or
more sets of medical images (which could include MRI, CT, PET-CT, X-fay, Doppler Ultrasound, etc.) and a
platform that supports a full 3d-viewer with controls and capabilities familiar to the student.
The assessment item task is almost universally something like Review the case data and imaging and write
up your findings.
28
Example 3: Diagnostic Radiology Exam
This is a typical item from a hypothetical diagnostic imaging exam.
This item requires the student to analyze and write up findings from examining a
set of [MRI, PET-CT, CT with Contrast, etc.] image sets appropriate to the case
data provided, or even, perhaps, in the absence of relevant case data.)
In English: add 1a/1b/1c if found; use this value, but if zero, use 1d if found. Subtract 1 from the
score if any factor in 1E is identified but use a score of zero if negative.
29
Example 3: Diagnostic Radiology Exam
30
Example 3: Diagnostic Radiology Exam
This is a typical item from a hypothetical diagnostic imaging exam.
This task is to analyze and write up findings from examining a set of [MRI, PET-CT,
CT with Contrast, etc.] image sets appropriate to the case data provided.
This diagnosis expects some important observations, interpretations and one
key finding that, if missing, is a fatal miss.
A two-part rubric (observation and interpretation) might look like this (with placeholder data):
#TRE1A-EV-AAA ::= [observation + using + proper + terms | or + alternates] = 8
#TRE1B-EV-AAA ::= [[related + other + observation] | alternate + terminology] = 2
#TRE1 = [#TRE1A-EV-AAA + #TRE1B-EV-AAA]
31
Status & Next Steps
Refine SGREL syntax as I encode rubrics for three sample
problems
32
Status & Next Steps
33
Questions and Discussion
Harry Layman
(949) 945-3373
Harry.Layman@CogWrite.com
www.CogWrite.com
34