刘正皓

个人信息Personal Information

副教授

教师拼音名称：liuzhenghao

出生日期：1994-10-22

电子邮箱：

入职时间：2021-07-12

所在单位：Dept. of Computer Science and Technology

职务：副教授

学历：博士研究生毕业

办公地点：信息学馆B233，浑南校区。

学位：工学博士学位

在职信息：在职

主要任职：清华大学自然语言处理实验室客座研究员

其他任职：东北大学计划财经处副处长（挂职）

毕业院校：清华大学

当前位置：中文主页 >> 科学研究

科研项目

主持/参与项目

1. 大规模复杂信息网络的表示学习与应用，2018年1月 - 2021年12月

国家自然科学基金面上项目（参与，结题），63万元

作为主要参与人完成了融合知识的信息检索系统与对话生成系统，通过对知识图谱中的无结构化信息以及结构化信息建模以增强模型效果，相关工作被ACL 2018和ACL 2020收录，并在Github上获得了超过270个星标。

2. 面向汉语言教学与传播的人工智能关键基础技术研究，2020年7月 - 2023年6月

上海市科委项目（参与，结题），500万元

作为主要参与人完成了面向二语学习者的中英文语法改错，实现了对语法改错结果进行质量评估，提升了语法改错效果。

3. 支持富文本文档检索的多片段语义表示融合技术，2023年1月 - 2025年12月

国家自然科学基金青年基金项目（主持，在研），30万元

本项目拟针对如下方面进行研究：1、富文本文档的片段语义表示构建方法；2.富文本文档的片段语义表示增强方法；3.富文本文档的多片段语义表示融合方法。本项目拟增强结构化文本语义表示、结构化非结构化文本的语义对齐以及语义融合问题，为利用诸如知识图谱、表格等结构化信息提供解决方案。

4 . 基于文本语义匹配的信息检索语言模型预训练方法，2022年1月 - 2022年12月

北京智源人工智能研究院悟道项目（主持，结题），50万元

本项目针对于文本语义表示学习设计了以语言为中心的多模态语义表示学习方法以及基于大模型提示词微调的文本表示方法。在WebQA等多模态理解数据中达到当前最好水平。

5. 面向开放域精准问答的语义检索与答案生成关键技术研究，2021年1月 - 2023年12月

中国博士后科学基金面上项目（主持，在研），12万元

本项目拟针开放域问答中如下方面进行研究：1、文本检索的精准性问题；2.面向阅读器的文本检索适配问题；3.基于多段落问答的答案质量评估问题。进一步为缓解大规模语言模型的内源、外源知识适配问题以及提升大规模预训练语言模型的生成答案事实一致性提供相应解决方案。

6. 基于文本语义匹配的开放域问答研究，2022年1月 - 2023年12月

高校基础科研业务费项目（主持，在研），17万元

本项目拟针开放域问答中文本精准语义匹配方法进行研究，拟通过针对少样本场景下的问题生成方法、面向噪音数据的去噪方法进行研究，以此来缓解在垂直领域中数据匮乏导致的神经网络训练瓶颈。

论文成果

Publications [Google Scholar]

* indicates equal contribution.

# indicates corresponding author.
2023
- Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, Ge Yu. Universal Multi-Modal Retrieval: Learning A Unified Representation Space for Vision Language Retrieval. The Eleventh International Conference on Learning Representations (ICLR 2023). [pdf][codes].
- Zhenghao Liu*#, Sen Mei, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, Ge Yu. Text Matching Improves Sequential Recommendation by Reducing Popularity Biases. The 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023). [pdf][codes].
- Shi Yu, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu. OpenMatch-v2: An All-in-one Multi-Modality PLM-based Information Retrieval Toolkit. The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023). [pdf][codes].
- Xinze Li, Zhenghao Liu#, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, Ge Yu. Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data. Findings of the Association for Computational Linguistics: ACL 2023 (ACL 2023). [pdf][codes].
- Ruining Chong, Cunliang Kong, Liu Wu, Zhenghao Liu, Ziye Jin, Liner Yang, Yange Fan, Hanghang Fan, Erhong Yang. Leveraging Prefix Transfer for Multi-Intent Text Revision. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). [pdf].
  2022
- Zhenghao Liu, Han Zhang, Chenyan Xiong, Zhiyuan Liu, Yu Gu, Xiaohua Li. Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder. The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022). [pdf][codes].
- Xiaomeng Hu, Shi Yu, Chenyan Xiong, Zhenghao Liu#, Zhiyuan Liu, Ge Yu. P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuning. The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022). [pdf][codes].
  2021
- Zhenghao Liu, Xiaoyuan Yi, Maosong Sun, Liner Yang, Tat-Seng Chua. Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction. The 2021 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT 2021). [pdf][codes].
- Zhenghao Liu*, Kaitao Zhang*, Chenyan Xiong, Zhiyuan Liu, Maosong Sun. OpenMatch: An Open Source Library for Neu-IR Research. The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). [pdf][codes].
- Shi Yu*, Zhenghao Liu*, Chenyan Xiong, Tao Feng, Zhiyuan Liu. Few-Shot Conversational Dense Retrieval. The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). [pdf][codes].
- Yizhi Li*, Zhenghao Liu*, Chenyan Xiong, Zhiyuan Liu. More Robust Dense Retrieval with Contrastive Dual Learning. The 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2021). [pdf][codes].
- Si Sun*, Zhenghao Liu*, Chenyan Xiong, Zhiyuan Liu and Jie Bao. Capturing Global Informativeness in Open Domain Keyphrase Extraction. The CCF Conference on Natural Language Processing and Chinese Computing (NLPCC 2021). [pdf][codes].
- Si Sun, Yingzhuo Qian, Zhenghao Liu, Chenyan Xiong, Kaitao Zhang, Jie Bao, Zhiyuan Liu, Paul Bennett. Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021). [pdf][codes].
- Huiyuan Xie, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu and Ann Copestake . TIAGE: A Benchmark for Topic-Shift Aware Dialog Modeling. Findings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021). [pdf][codes]
  2020
- Zhenghao Liu, Chenyan Xiong, Maosong Sun, Zhiyuan Liu. Fine-grained Fact Verification with Kernel Graph Attention Network. The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). [pdf][codes].
- Zhenghao Liu, Chenyan Xiong, Zhuyun Dai, Si Sun, Maosong Sun, Zhiyuan Liu. Adapting Open Domain Fact Extraction and Verification to COVID-FACT through In-Domain Language Modeling. Findings of the Association for Computational Linguistics: EMNLP 2020 (EMNLP 2020). [pdf][codes].
- Houyu Zhang*, Zhenghao Liu*, Chenyan Xiong, Zhiyuan Liu. Grounded Conversation Generation as Guided Traverses in Commonsense Knowledge Graphs. The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). [pdf][codes].
- Chenyan Xiong*, Zhenghao Liu*, Si Sun*, Zhuyun Dai*, Kaitao Zhang*, Shi Yu*, Zhiyuan Liu, Hoifung Poon, Jianfeng Gao, Paul Bennett. CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search. [pdf][codes].
- Xiaoyuan Yi, Zhenghao Liu, Wenhao Li, Maosong Sun. 2020. Text Style Transfer via Learning Style Instance Supported Latent Space. The 28th International Joint Conference on Artificial Intelligence (IJCAI 2019). [pdf].
- Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu. Selective Weak Supervision for Neural Information Retrieval. The Web Conference 2020 (WebConf 2020). [pdf][codes].
- Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng Li, Maosong Sun, Zhiyuan Liu. Coreferential Reasoning Learning for Language Representation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020). [pdf][codes].
  2019
- Zhenghao Liu, Chenyan Xiong, Maosong Sun, Zhiyuan Liu. Explore Entity Embedding Effectiveness in Entity Retrieval. The 18th China National Conference on Computational Linguistics (CCL 2019).[pdf][codes].
- Yifan Qiao, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu. Understanding the Behaviors of BERT in Ranking. arXiv preprint arXiv:1904.07531.[pdf].
- Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Maosong Sun. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019).[pdf][codes].
  2018
- Zhenghao Liu, Chenyan Xiong, Maosong Sun, Zhiyuan Liu. Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval. The 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018).[pdf][codes].
  2017
- Liner Yang, Maosong Sun, Jiacheng Zhang, Zhenghao Liu, Huanbo Luan, Yang Liu. Neural Parse Combination. Journal of Computer Science and Technology, 2017.[pdf].

研究领域

研究方向

研究方向主要自然语言处理与信息检索技术，具体但不限于如下方向：

1、信息检索：基于少样本学习的神经网络信息检索方法、面向多模态数据的稠密向量检索方法、面向海量数据的高效索引建模方法；(与卡内基梅隆大学、清华大学合作)

2、知识增强的大语言模型：大语言模型工具智能、面向外源知识的向量建模方法；（与清华大学合作）

3、基于大模型的推荐系统研究：基于商品内容以及大模型实现多模态推荐、序列化推荐；（与阿里巴巴合作）

4、开放域自动问答、事实验证、法律智能：面向客观事实以及法律领域的大模型人类反馈对齐研究、面向大模型的受控生成研究；（与清华大学、微软亚洲研究院合作）

5、面向教育的大语言模型研究：基于教材数据训练面向中小学教育的大语言模型。（与清华大学、北京语言大学合作）

开源项目

1、信息检索开源平台以及应用OpenMatch（网址：https://github.com/OpenMatch）。汇总了组内开源研究。

2、其他部分开源项目（数据截止至2023年8月15日）

项目名称	项目地址	Star	Fork
EntityDUET	https://github.com/thunlp/EntityDuetNeuralRanking	152	20
BERT KPE	https://github.com/thunlp/BERT-KPE	424	78
KernelGAT	https://github.com/thunlp/KernelGAT	159	34
OpenMatch v1.0	https://github.com/thunlp/OpenMatch	442	46
ConceptFlow	https://github.com/thunlp/ConceptFlow	118	19

部分项目成果

1. 面向少样本学习的神经信息检索模型

项目背景：近年来，神经信息检索（Neural Information Retrieval，Neu-IR）作为一种先进的信息检索方法，已经在各个领域展现出强大的效果。然而，Neu-IR 的有效性往往依赖于大规模的领域内相关性训练信号。然而，在实际信息检索场景中，例如：法律领域和生物医学领域，其问题-文档相关性信号的标注通常十分昂贵。因此，现有的神经信息检索模型通常会面临着漏标（Hole Rate）问题。因而，实现一个的领域自适应学习方法将神经信息检索模型从标注丰富的领域推广到少样本信息检索领域十分重要。

项目研究成果：相关工作在美国官方标准局的面向新型冠状病毒肺炎的信息检索比赛（TREC-COVID）的第二轮无人工干预组取得第一名的成绩，并被微软应用至其面向生物医学领域的信息检索系统中（网址：https://biomedsearch.microsoft.com/en-us/）。详细介绍请移步至其技术博客（网址：https://blogs.microsoft.com/ai-for-business/biomedical-search/）。相关成果被ACL2021、WebConf2020、SIGIR2021收录，模型概要如下图所示。

图片 1.png

2. 面向多模态数据的神经信息检索模型

项目背景：在人类认知世界的过程中，诸如图片、表格、知识图谱以及结构化列表等多模态数据起到了至关重要的作用。面向多模态数据融合的信息检索方法可以通过检索并整合来自不同模态的数据作为外源知识，以弥补单一模态数据的局限性，提高搜索结果的覆盖范围和语义丰富性。传统的信息检索模型往往针对于单模态信息检索以及跨模态信息检索进行建模，并只从单一模态数据中返回文档以满足用户需求。然而随着 Flamingo、GPT4 等多模态预训练语言模型的兴起，单一模态数据已经不能满足用户的信息获取需求，为信息检索相关工作提出了更大挑战。对于一个给定的用户问题，本项目拟将以语言为中心的多模态数据表征作为基石，微调信息检索模型使其将多模态数据编码至统一的向量空间中，并根据用户问题实现检索文档模态选择、单模态信息检索、跨模态信息检索以及多模态信息融合的端到端建模方法，最终能够返回满足用户信息获取需求的由多模态文档构成的外源知识候选集合。

项目研究成果：相关工作在多模态检索数据集WebQA、代码检索数据集CodeSearchNet以及商品检索数据集ESCI上取得了较好的检索精度。相关工作被ICLR2023、ACL2023收录，模型概要如下图所示。

图片 3.png

图片 2.png

3. 面向智慧教育的大语言模型研究

项目背景：随着ChatGPT引起全社会的关注，及各类大语言模型（Large Language Model）争相亮相，通用域自然语言处理任务已获得巨大成功，引起了国际中文教育领域的普遍关注。国际中文教育人士纷纷展开了对大模型的探讨：大模型是否可以根据学习者的水平，提供合适的语言表达，或根据学习者的问题给出详细的解答，从而在一定程度上辅助甚至充当学习伙伴、语言教师。

项目研究成果：

a. 与清华大学、北京语言大学联合发布桃李1.0版本（https://mp.weixin.qq.com/s/NZpY8y6hBnFvcfTwLqYvDQ）。

b. 与北京语言大学联合发布智源指数（https://mp.weixin.qq.com/s/5TTx73F-QiJ-RVBszBi8sQ）。

c. 与清华大学、北京语言大学联合组织第二十一届、第二十二届中国计算语言学大会（CLTC-2022和CLTC-2023）汉语学习者语法改错评测。

d. 针对语法改错、语法改错质量评估、语法错误检查相关任务中达到先进水平，相关论文被NAACL2021和ACL2023收录，相关研究如下图所示。

图片 4.png

个人信息Personal Information

科研项目

论文成果

Publications [Google Scholar]

2024

2022

2021

2020

2019

2018

2017