|
个人信息Personal Information
副教授
教师拼音名称:liuzhenghao
电子邮箱:
入职时间:2021-07-12
职务:副教授
办公地点:信息学馆B233,浑南校区。
主要任职:清华大学自然语言处理实验室客座研究员
其他任职:东北大学计划财经处副处长(挂职)
毕业院校:清华大学
-
主持项目:
1. 基于大语言模型的检索增强智能体构建与协同优化研究,2025年8月 - 2029年12月
国家自然科学基金面上项目(在研),51万元
2. 2025-QGS-KY-16-FCY-035,2024年12月 - 2026年6月
某项目(在研),140万元
3. 基于大语言模型的知识记忆、理解与生成框架研究与实现,2025年9月 - 2026年9月
面壁智能校企合作项目(在研),50万元
4. 支持富文本文档检索的多片段语义表示融合技术,2023年1月 - 2025年12月
国家自然科学基金青年基金项目(结题),30万元
5 . 基于文本语义匹配的信息检索语言模型预训练方法,2022年1月 - 2022年12月
北京智源人工智能研究院悟道项目(结题),50万元
6. 面向开放域精准问答的语义检索与答案生成关键技术研究,2021年1月 - 2023年12月
中国博士后科学基金面上项目(结题),12万元
7. 面向多模态文档的语言模型表示学习与向量检索关键技术研究,2023年12月 - 2025年11月
辽宁省面上基金项目(结题),8万元
8. 基于大模型端到端的长文档版面分析与要素抽取技术,2025年3月 - 2026年3月
阿里巴巴AIR项目(结题),30万元
9. 面向大模型自主知识学习推理增强框架研究,2024年12月 - 2025年12月
CCF-智谱大模型基金项目(结题),10万元
10. 基于知识图谱的工艺自动规划模块开发,2023年12月 - 2024年12月
沈阳自动化研究所合作项目(结题),10万元
11. 基于文本语义匹配的开放域问答研究,2022年1月 - 2023年12月
高校基础科研业务费项目(结题),17万元
参与项目:
1. 大规模复杂信息网络的表示学习与应用,2018年1月 - 2021年12月
国家自然科学基金面上项目(结题),63万元
2. 面向汉语言教学与传播的人工智能关键基础技术研究,2020年7月 - 2023年6月
上海市科委项目(结题),500万元
-
Google Scholar (* indicates equal contribution; # indicates corresponding author)
Mingyan Wu, Zhenghao Liu#, Yukun Yan, Xinze Li, Shi Yu, Zheni Zeng, Yu Gu, Ge Yu. RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts. ACL 2025. [pdf][codes].
Shuliang Liu, Xinze Li, Zhenghao Liu#, Yukun Yan, Cheng Yang, Zheni Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu. Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models. ACL 2025: Findings. [pdf][codes].
Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, Xu Han, Zhiyuan Liu, Maosong Sun. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. ACL 2025. [pdf][codes].
Qiushi Xiong, Zhipeng Xu, Zhenghao Liu#, Mengjia Wang, Zulong Chen, Yue Sun, Yu Gu, Xiaohua Li, Ge Yu. Enhancing the Patent Matching Capability of Large Language Models via the Memory Graph. SIGIR 2025. [pdf][codes].
Pengcheng Huang, Zhenghao Liu#, Yukun Yan, Haiyan Zhao, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong. ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation. NeurIPS 2025. [pdf][codes].
Xiaoang Xu, Shuo Wang, Xu Han, Zhenghao Liu, Huijia Wu, Peipei Li, Zhiyuan Liu, Maosong Sun, Zhaofeng He. A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings. NeurIPS 2025. [pdf][codes].
Xinze Li*, Sen Mei*, Zhenghao Liu#, Yukun Yan#, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong. RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards. ICLR 2025. [pdf][codes].
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents. ICLR 2025. [pdf][codes].
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun. Advancing llm reasoning generalists with preference trees. ICLR 2025. [pdf][codes].
Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun. Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts. MM 2025. [pdf][codes].
Sijia Yao*, Pengcheng Huang*, Zhenghao Liu#, Yu Gu, Yukun Yan, Shi Yu, Ge Yu. ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance. EMNLP 2025. [pdf][codes].
Hao Chen*, Yukun Yan*, Sen Mei, Wanxiang Che#, Zhenghao Liu#, Qi Shi, Xinze Li, Yuchun Fan, Pengcheng Huang, Qiushi Xiong, Zhiyuan Liu, Maosong Sun. ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation. EMNLP 2025: Findings. [pdf][codes].
Zhensheng Jin*, Xinze Li*, Yifan Ji, Chunyi Peng, Zhenghao Liu#, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu. ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization. EMNLP 2025: Findings. [pdf][codes].
Zheni Zeng, Jiayi Chen, Huimin Chen, Yukun Yan, Yuxuan Chen, Zhenghao Liu, Zhiyuan Liu, Maosong Sun. PersLLM: A Personified Training Approach for Large Language Models. EMNLP 2025: Findings. [pdf][codes].
Ruobing Wang, Qingfei Zhao, Yukun Yan, Daren Zha, Yuxuan Chen, Shi Yu, Zhenghao Liu, Yixuan Wang, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun. DeepNote: Note-Centric Deep Retrieval-Augmented Generation. EMNLP 2025: Findings. [pdf][codes].
Zheni Zeng*, Yuxuan Chen*, Shi Yu, Ruobing Wang, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun. KBAlign: Efficient Self Adaptation on Specific Textual Knowledge Bases. EMNLP 2025: Findings. [pdf][codes].
Weiqing Yang, Hanbin Wang, Zhenghao Liu#, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, Ge Yu. COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis. NAACL 2025: Findings. [pdf][codes].
Mingxing Shao, Tiancheng Zhang, Minghe Yu, Zhenghao Liu, Yifang Yin, Hengyu Liu, Ge Yu. Leveraging Student Profiles and the Mamba Framework to Enhance Knowledge Tracing. ECML-PKDD 2025. [pdf][codes].
Yifan Ji*, Zhipeng Xu*, Zhenghao Liu#, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, Maosong Sun. Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking. SIGIR-AP 2025. [pdf][codes].
Buqiang Xu*, Xin Dai*, Zhenghao Liu#, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, Yukun Yan, Liner Yang, Yu Gu, Ge Yu. Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking. ADMA 2025. [pdf][codes].
Zhipeng Xu, Zhenghao Liu#, Yukun Yan, Zhiyuan Liu, Ge Yu, Chenyan Xiong. Cleaner Pretraining Corpus Curation with Neural Web Scraping. ACL 2024. [pdf][codes].
Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu#, Chenyan Xiong, Zhiyuan Liu, Yu Gu, Ge Yu. MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin. ACL 2024. [pdf][codes].
Hanbin Wang, Zhenghao Liu#, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, Ge Yu . INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair. ACL 2024: Findings. [pdf][codes].
Haoyu Wang, Shuo Wang, Yukun Yan, Xujia Wang, Zhiyu Yang, Yuzhuang Xu, Zhenghao Liu, Liner Yang, Ning Ding, Xu Han, Zhiyuan Liu, Maosong Sun. UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset. ACL 2024. [pdf][codes].
Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, Zhiyuan Liu, Xiaodong Shi, Maosong Sun. MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization. ACL 2024: Findings. [pdf][codes].
Zhenghao Liu, Zulong Chen*, Moufeng Zhang*, Shaoyang Duan, Hong Wen, Liangyue Li, Nan Li, Yu Gu, Ge Yu. Modeling User Viewing Flow Using Large Language Models for Article Recommendation. WebConf 2024. [pdf].
Na Guo, Yaqi Wang, Wenli Sun, Yu Gu, Jianzhong Qi, Zhenghao Liu, Xiufeng Xia, Ge Yu. Chameleon: Towards Update-Efficient Learned Indexing for Locally Skewed Data. ICDE 2024. [pdf].
Cheng Gao, Chaojun Xiao, Zhenghao Liu, Huimin Chen, Zhiyuan Liu, Maosong Sun. Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs. EMNLP 2024. [pdf][codes].
Shi Yu, Chenghao Fan, Chenyan Xiong, David Jin, Zhiyuan Liu, Zhenghao Liu#. Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion. COLING 2024. [pdf][codes].
Ruining Chong, Luming Lu, Liner Yang, Jinran Nie, Zhenghao Liu, Shuo Wang, Shuhan Zhou, Yaoxin Li, Erhong Yang. MCTS: A Multi-Reference Chinese Text Simplification Dataset. COLING 2024. [pdf][codes].
Cheng Qian, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu. Toolink: Linking toolkit creation and using through chain-of-solving on open-source model. NAACL 2024. [pdf][codes].
Xinze Li, Hanbin Wang, Zhenghao Liu#, Shi Yu, Shuo Wang, Yukun Yan, Yukai Fu, Yu Gu, Ge Yu. Building A Coding Assistant via the Retrieval-Augmented Language Model. ACM Transactions on Information Systems (TOIS). [pdf][code].
Yumeng Song, Yu Gu, Tianyi Li, Jianzhong Qi, Zhenghao Liu, Christian S Jensen, Ge Yu. CHGNN: A Semi-Supervised Contrastive Hypergraph Learning Network. IEEE Transactions on Knowledge and Data Engineering (TKDE). [pdf][code].
Yuqing Lan, Zhenghao Liu#, Yu Gu#, Xiaoyuan Yi, Xiaohua Li, Liner Yang, Ge Yu. Multi-Evidence based Fact Verification via A Confidential Graph Neural Network. IEEE Transactions on Big Data (TBD). [pdf][code].
Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, Ge Yu. Universal Multi-Modal Retrieval: Learning A Unified Representation Space for Vision Language Retrieval. ICLR 2023. [pdf][codes].
Zhenghao Liu*#, Sen Mei*, Chenyan Xiong, Xiaohua Li, Shi Yu, Zhiyuan Liu, Yu Gu, Ge Yu. Text Matching Improves Sequential Recommendation by Reducing Popularity Biases. CIKM 2023. [pdf][codes].
Shi Yu, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu. OpenMatch-v2: An All-in-one Multi-Modality PLM-based Information Retrieval Toolkit. SIGIR 2023. [pdf][codes].
Xinze Li, Zhenghao Liu#, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, Ge Yu. Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data. ACL 2023: Findings. [pdf][codes].
Ruining Chong, Cunliang Kong, Liu Wu, Zhenghao Liu, Ziye Jin, Liner Yang, Yange Fan, Hanghang Fan, Erhong Yang. Leveraging Prefix Transfer for Multi-Intent Text Revision. ACL 2023. [pdf].
Zhenghao Liu, Han Zhang, Chenyan Xiong, Zhiyuan Liu, Yu Gu, Xiaohua Li. Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder. EMNLP 2022. [pdf][codes].
Xiaomeng Hu, Shi Yu, Chenyan Xiong, Zhenghao Liu#, Zhiyuan Liu, Ge Yu. P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuning. SIGIR 2022. [pdf][codes].
Zhenghao Liu, Xiaoyuan Yi, Maosong Sun, Liner Yang, Tat-Seng Chua. Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction. NAACL 2021. [pdf][codes].
Zhenghao Liu*, Kaitao Zhang*, Chenyan Xiong, Zhiyuan Liu, Maosong Sun. OpenMatch: An Open Source Library for Neu-IR Research. SIGIR 2021. [pdf][codes].
Shi Yu*, Zhenghao Liu*, Chenyan Xiong, Tao Feng, Zhiyuan Liu. Few-Shot Conversational Dense Retrieval. SIGIR 2021. [pdf][codes].
Yizhi Li*, Zhenghao Liu*, Chenyan Xiong, Zhiyuan Liu. More Robust Dense Retrieval with Contrastive Dual Learning. The 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2021). [pdf][codes].
Si Sun*, Zhenghao Liu*, Chenyan Xiong, Zhiyuan Liu and Jie Bao. Capturing Global Informativeness in Open Domain Keyphrase Extraction. The CCF Conference on Natural Language Processing and Chinese Computing (NLPCC 2021). [pdf][codes].
Si Sun, Yingzhuo Qian, Zhenghao Liu, Chenyan Xiong, Kaitao Zhang, Jie Bao, Zhiyuan Liu, Paul Bennett. Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision. ACL 2021. [pdf][codes].
Huiyuan Xie, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu and Ann Copestake . TIAGE: A Benchmark for Topic-Shift Aware Dialog Modeling. EMNLP 2021: Findings. [pdf][codes]
Zhenghao Liu, Chenyan Xiong, Maosong Sun, Zhiyuan Liu. Fine-grained Fact Verification with Kernel Graph Attention Network. ACL 2020. [pdf][codes].
Zhenghao Liu, Chenyan Xiong, Zhuyun Dai, Si Sun, Maosong Sun, Zhiyuan Liu. Adapting Open Domain Fact Extraction and Verification to COVID-FACT through In-Domain Language Modeling. EMNLP 2020: Findings. [pdf][codes].
Houyu Zhang*, Zhenghao Liu*, Chenyan Xiong, Zhiyuan Liu. Grounded Conversation Generation as Guided Traverses in Commonsense Knowledge Graphs. ACL 2020. [pdf][codes].
Chenyan Xiong*, Zhenghao Liu*, Si Sun*, Zhuyun Dai*, Kaitao Zhang*, Shi Yu*, Zhiyuan Liu, Hoifung Poon, Jianfeng Gao, Paul Bennett. CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search. [pdf][codes].
Xiaoyuan Yi, Zhenghao Liu, Wenhao Li, Maosong Sun. 2020. Text Style Transfer via Learning Style Instance Supported Latent Space. IJCAI 2019. [pdf].
Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu. Selective Weak Supervision for Neural Information Retrieval. WebConf 2020. [pdf][codes].
Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng Li, Maosong Sun, Zhiyuan Liu. Coreferential Reasoning Learning for Language Representation. EMNLP 2020. [pdf][codes].
Zhenghao Liu, Chenyan Xiong, Maosong Sun, Zhiyuan Liu. Explore Entity Embedding Effectiveness in Entity Retrieval. Proceedings of Chinese National Conference on Computational Linguistics (CCL 2019).[pdf][codes].
Yifan Qiao, Chenyan Xiong, Zhenghao Liu, Zhiyuan Liu. Understanding the Behaviors of BERT in Ranking. [pdf].
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Maosong Sun. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. ACL 2019.[pdf][codes].
Zhenghao Liu, Chenyan Xiong, Maosong Sun, Zhiyuan Liu. Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval. ACL 2018.[pdf][codes].
Liner Yang, Maosong Sun, Jiacheng Zhang, Zhenghao Liu, Huanbo Luan, Yang Liu. Neural Parse Combination. Journal of Computer Science and Technology (JCST).[pdf].
2025
2024
2023
2022
2021
2020
2019
2018
2017
-
研究方向
研究方向主要自然语言处理与信息检索技术,具体但不限于如下方向:
1)大模型自主知识获取技术体系构建。以大模型为核心开展相关性数据合成研究,融合网页、知识库及多模态信息等丰富的外部资源,探索面向多源信息融合的相关性表示学习技术,形成面向多模态语料的向量检索、重排序以及大模型驱动的检索知识利用与检索训练方法体系,有力支持大规模语料环境下的高效检索与知识获取。依托相关研究成果,团队在美国国家标准与技术研究院组织的新冠肺炎文档级检索评测TREC-COVID第二轮任务中取得第一名,并被微软应用于其线上商业检索系统(https://blogs.microsoft.com/ai-for-business/biomedical-search/)。
2)多模态知识指导的大模型增强技术。围绕多模态数据增强机制开展系统研究,针对检索增强生成中外部知识与模型内部知识之间的冲突问题,提出多模态知识指导的大模型增强方法,通过混合知识库调度机制构建“1+1>2”的多模态检索增强融合范式;同时利用多模态大模型的OCR能力,提出首个基于纯视觉理解的检索增强生成方案,显著提升模型在复杂文档场景下的智能解析能力。相关研究成果获得谷歌、三星、Adobe等多家机构引用和应用。知名学者Yann LeCun(图灵奖获得者、美国国家科学院院士)以及Philip S. Yu(ACM/IEEE Fellow)在国际顶级会议中将该工作作为代表性研究进行评价与比较。托校企合作开展成果转化,相关研究显著提升阿里巴巴的文档智能推荐场景中推荐效果(线上A/B测试结果表明提升2.4%),并已应用于阿里巴巴ATA智能文章推荐系统线上平台,同时通过多模态知识指导大模型增强技术通过合成数据技术支持Qwen-3.5多模态基座模型训练,提升其在文档理解与智能处理方面的性能3%。
3)检索增强智能体协同优化框架。面向垂直领域数据稀缺与模型优化难题,构建检索增强生成训练一体化沙盒,实现以大语言模型为基础的数据合成与评测框架,通过数据差分奖励机制实现检索增强智能体的数据供需关系对齐,显著提升检索知识向生成能力的高效转化。依托相关研究成果,合作研发端侧大语言模型MiniCPM的检索增强生成组件,相关模型在HuggingFace平台累计下载量超过38万次。相关研究成果皆以UltraRAG开源工具形式进行发布,其为首个基于模型上下文协议的检索增强生成框架,累计获得超过5.5k星标。相关技术已成功转化应用于清华大学大模型驱动的学生AI成长助手“清小搭”中,显著提升大模型知识增强应用场景与应用范围。

