HotpotQA数据集备注
文件
TRAIN: hotpot_train_v1.1.json http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json
DEV: hotpot_dev_distractor_v1.json http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_distractor_v1.json
PRED(sample): sample_dev_pred.json http://curtis.ml.cmu.edu/datasets/hotpot/sample_dev_pred.json
Evaluation Script: https://raw.githubusercontent.com/hotpotqa/hotpot/master/hotpot_evaluate_v1.py
json结构
总体结构如下
{
_id : 5abd94525542992ac4f382d2,
answer: 'YG Entertainment',
question: '2014 S/S is the debut album of a South Korean boy group that was formed by who?' ,
supporting_facts: [['2014 S/S', 0], ['Winner (band)', 0]] ,
context: [['List of awards and nominations received by Shinee',[' The group was formed by S.M. Entertainment in 2008 and...', 'South Korean boy group Shinee have received several awards...',...],[...,[...]],...]
type: hard ,
level: bridge
}
- _id 每个问题的唯一hash编号,
str
类型 - answer 问题答案,
str
- question 问题,
str
- supporting_facts 支持段落,
list
。形式:[ [‘para_name_1’ , sentence_index ], [‘para_name_2’ , sentence_index ] , … ] Nested Element: [str
,int
] 段落名,句子编号 - context 上下文,
list
。 形式: [ [ ‘$ParaName_1$’,[‘$sentence_1$’,’$sentence_2$’,…,’$sentence_t$’]],[‘$ParaName_2$’,[…]],….,[‘$ParaName_n$’,[…]] ] - type 难度类型, easy,medium,hard 三选一。
str
- level 问题种类 , bridge或comparison二选一。
str
数据集大小(单位:问题)
- train: 90447
- dev: 7405
Context长度分布
注:使用官方评测中的normalize_answer()正则后的文本
min | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
TRAIN | 24 | 524 | 604 | 662 | 714 | 764 | 816 | 874 | 947 | 1059 | 2511 |
DEV | 39 | 533 | 613 | 673 | 725 | 772 | 826 | 884 | 954 | 1069 | 2337 |
一些特殊长度的等比分点:
512 | 1024 | 1500 | 2000 | |
---|---|---|---|---|
TRAIN | 8.874% | 87.419% | 99.355% | 99.910% |
DEV | 8.062% | 86.685% | 99.399% | 99.919% |
特殊问题
标准答案应回答‘yes’或’no’的问题数量。
- Train: 5483 / 90477 6.060%
- Dev: 458 / 7405 6.185%
标准答案应回答‘noanswer’的问题数量。
- Train: 0 / 90477 0%
- Dev: 0 / 7405 0%
评价函数
EM Score
Exact Matching的个数,如果normalized后的答案与normalized后的gold答案一致,EM值加一。
F1 Score
num same 预测文本与gold文本的相同的单词数
precision
$precision = NumSame / prediction$
recalll
$recall = NumSame / gold$
Span F1
计算span的F1值,公式相对简单
Ground Truth | False | |
---|---|---|
Predicted True | TP | FP |
Not Predicted | FP | FN |
Joint F1
将F1与SpanF1结合,即precision与recall分别两两相乘。然后再计算JointF1,就是将$JointPrecision$与$JointRecall$合并计算。