Pinzhen Chen
I am a postdoctoral researcher in School of Informatics, University of Edinburgh, where I am also a member of the
machine translation group,
EdinburghNLP,
and Institute for Language, Cognition and
Computation. I also go by Patrick or 陈品桢.
I work on the
High Performance Language Technologies project,
specifically
on the translation system pipeline and multilingual large language models (LLMs). I occasionally contribute to
MaLA and UTTER on LLM multilinguality and evaluation. I am also a senior NLP engineer at Aveni, making LLMs for financial services.
Last updated on 11 Nov 2024.
[
| Semantic Scholar
| Google Scholar
| GitHub
| Hugging Face
| LinkedIn]
Experience
- 2024-present, University of Edinburgh, Research Associate
- 2024-present, Aveni.ai, Senior NLP Engineer
- 2020-2024, University of Edinburgh, PhD supervised by
Kenneth Heafield
and
Barry Haddow
- 2023, Microsoft Research Asia, Research Visit
- 2022, Huawei Noah's Ark Lab, Research Scientist Intern
- 2019, University of Edinburgh, Research Assistant
- 2015-2019, University of Edinburgh, BEng Artificial Intelligence and Software Engineering. Awarded first
class honours and a Class Medal for attaining the top performance in the degree
- 2018, Goldman Sachs, Technology Analyst Intern
Services
- Action Editor/Area Chair
- Association for Computational Linguistics Rolling Review (ARR): 2024
- Program Committee/Reviewer
- International Conference on Learning Representations (ICLR): 2025
- Conference on Neural Information Processing Systems (NeurIPS): 2024
- ACM Computing Surveys: 2024
- Information Processing and Management: 2024
- Conference on Language Modeling (COLM): 2024
- European Conference on Artificial Intelligence (ECAI): 2024
- Association for Computational Linguistics Rolling Review (ARR): 2021, 2023, 2024
- Joint Conference on Lexical and Computational Semantics (*SEM): 2022, 2023, 2024
- Financial Support for Third Parties from the Horizon Europe project Unified Transcription and
Translation for Extended Reality (UTTER FSTP): 2023
- Workshop on Instruction Tuning and Instruction Following: 2023
- Conference on Empirical Methods in Natural Language Processing (EMNLP): 2023
- Conference on Machine Translation (WMT): 2021, 2022
- International Workshop on Semantic Evaluation (SemEval): 2022
- Teaching Assistant at University of Edinburgh
- Machine Learning Practical: mentor and marker, 2020-21, 2021-22, 2022-23
- Introductory Applied Machine Learning: marker, 2020-21, 2021-22
- Informatics Research Proposal: tutor, 2020-21
- System Design Project: mentor, 2018-19
- Supervision
- Dayyán O'Brien. 2024. Research Intern. Multilingual data processing, as part of the EMMA-500 model's effort.
- Zhanghao Hu, Yijun Yang, and Junjie Xu. 2023. Machine Learning Pratical project on efficient
question answering, shortlisted for a
best project prize donated by IBM UK and published at LREC-COLING 2024.
Research
Multilingualism
-
Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, and Barry Haddow.
EMMA-500: Enhancing massively multilingual adaptation of large language models. 2024.
arXiv preprint.
-
Pinzhen Chen, Simon Yu, Zhicheng Guo, and Barry Haddow.
Is it good data for multilingual
instruction tuning or just bad multilingual evaluation for large language models?. 2024.
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
-
Nikolay Arefyev, Mikko Aulamo, Pinzhen Chen, Ona de Gibert, Barry Haddow, Jindřich Helcl,
Bhavitvya Malik, Gema Ramírez-Sánchez, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, and Jaume
Zaragoza.
HPLT's first release of data and
models.
2024.
In Proceedings of the 25th Annual Conference of the European Association for Machine Translation.
-
Nikolay Bogoychev, Pinzhen Chen, Barry Haddow, and Alexandra Birch.
The ups and downs of large
language model
inference with vocabulary trimming by language heuristics. 2024.
In Proceedings of the Fifth Workshop on Insights from Negative Results in NLP.
-
Shaoxiong Ji and Pinzhen Chen.
How many languages make good multilingual instruction tuning? A case study on BLOOM. 2024.
arXiv preprint.
-
Pinzhen Chen*, Shaoxiong Ji*, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth
Heafield.
Monolingual or
multilingual
instruction tuning: Which makes a better Alpaca. 2024.
In Findings of the Association for Computational Linguistics: EACL 2024.
-
Ashok Urlana*, Pinzhen Chen*, Zheng Zhao, Shay B. Cohen, Manish Shrivastava, and Barry
Haddow.
PMIndiaSum: Multilingual
and cross-lingual headline summarization for languages in India. 2023.
In Findings of the Association for Computational Linguistics: EMNLP 2023.
Translation
-
Dawei Zhu, Pinzhen Chen, Miaoran Zhang, Barry Haddow, Xiaoyu Shen, and Dietrich Klakow.
Fine-tuning large language models to
translate: Will a touch of noisy data in misaligned languages suffice?. 2024.
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.
-
Vilém Zouhar*, Pinzhen Chen*, Tsz Kin Lam, Nikita Moghe, and Barry Haddow. Pitfalls and outlooks in using COMET. 2024.
In Proceedings of the Ninth Conference on Machine Translation.
-
Vivek Iyer, Bhavitvya Malik*, Pavel Stepachev*, Pinzhen Chen, Barry Haddow, and Alexandra Birch.
Quality or quantity? On data scale and diversity
in adapting large language models for low-resource translation. 2024.
In Proceedings of the Ninth Conference on Machine Translation.
-
Pinzhen Chen, Zhicheng Guo, Barry Haddow, and Kenneth Heafield.
Iterative translation refinement with
large language models. 2024.
In Proceedings of the 25th Annual Conference of the European Association for Machine Translation.
-
Nikolay Bogoychev* and Pinzhen Chen*.
Terminology-aware translation with
constrained decoding and large language model prompting. 2023.
In Proceedings of the Eighth Conference on Machine Translation.
-
Vivek Iyer, Pinzhen Chen, and Alexandra Birch.
Towards effective disambiguation
for
machine translation with large language models. 2023.
In Proceedings of the Eighth Conference on Machine Translation.
-
Nikolay Bogoychev* and Pinzhen Chen*.
The highs and lows of simple lexical domain adaptation approaches for neural machine translation. 2021.
In Proceedings of the Second Workshop on Insights from Negative Results in NLP.
-
Pinzhen Chen*, Nikolay Bogoychev*, Kenneth Heafield, and Faheem Kirefu.
Parallel sentence mining by constrained decoding. 2020.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
-
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza.
ParaCrawl: Web-scale acquisition of parallel corpora. 2020.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Other topics
-
Hanxu Hu*, Simon Yu*, Pinzhen Chen*, and Edoardo M. Ponti.
Fine-tuning large language models with
sequential instructions. 2024.
arXiv preprint.
-
Yuanchao Li, Pinzhen Chen, Peter Bell, and Catherine Lai.
Crossmodal ASR error correction with discrete speech units. 2024.
Accepted to 2024 IEEE Spoken Language Technology Workshop.
-
Zhanghao Hu*, Yijun Yang*, Junjie Xu*, Yifu Qiu, and Pinzhen Chen.
EEE-QA: Exploring effective
and efficient
question-answer representations. 2024.
In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language
Resources and Evaluation.
-
Zeyu Zhao, Pinzhen Chen, and Peter Bell.
Regarding topology and adaptability in differentiable WFST-based E2E ASR. 2024.
In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops.
-
Pinzhen Chen and Gerasimos Lampouras.
Exploring data
augmentation for code generation tasks. 2023.
In Findings of the Association for Computational Linguistics: EACL 2023.
-
Pinzhen Chen and Zheng Zhao.
A unified model for reverse dictionary and definition modelling. 2022.
In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing.
-
Pinzhen Chen and Kenneth Heafield.
Approaching neural Chinese word segmentation as a low-resource machine translation task. 2022.
In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation.
Personal
I enjoy travelling, cooking, and doing photography. I sometimes play badminton, basketball, as well as board and
card games. Thanks for reading this far. Here is the reward for reinforcement—photos of my cat Luckie.