https://www.epo.org/en/searching-for-patents/helpful-resources/patent-knowledge-news/technology-intelligence-platform-3

Technology Intelligence Platform

Image
Decorative image

Harmonisation of applicant names

In the first three articles in our Technology Intelligence Platform (TIP) series, we explored technological fields and their evolution, then shifted our focus to time series forecasting for patent filings. By leveraging PATSTAT data and combining it with the TIP’s data processing and visualisation capabilities, the accompanying notebooks provided powerful insights into the evolution of technical fields and demonstrated how TIP can be used to anticipate trends in innovation.

In this article, we explore a critical aspect of patent documentation: handling inconsistencies in applicant names. Patent databases store applications with the corresponding applicants and their addresses exactly as they appear in the original filing. Consequently, variations in the representation of applicant names and addresses are common. These discrepancies arise from issues such as differences in word order, capitalisation preferences, inclusion or omission of accents, variations in legal entity designations and typographical errors.

Such inconsistencies pose challenges when retrieving all applications filed by a specific company. Since computers interpret these variations as distinct strings, they fail to recognise them as belonging to the same entity. The Organisation for Economic Co-operation and Development (OECD) recognises applicant name harmonisation as vital for studying innovation as it enables researchers, analysts and policymakers to track patenting activities more accurately. 

Existing harmonisation efforts

PATSTAT already includes mechanisms to harmonise applicant names. The PATSTAT Standardised Name (PSN), developed by the University of Leuven, and the Harmonised Applicant Name (HAN), provided by the OECD, apply standardisation processes to applicant names. 

Practical approach to cleaning applicant names

We have developed a notebook that builds an applicant name harmonisation algorithm from the ground up. Based on a typical data retrieval query in PATSTAT Global, we apply a set of techniques to cluster applicant names into potential duplicates. By exploring the notebook, you can learn how to use standard lists of variations for abbreviations of legal entities, python libraries for deduplicating records, and other tools to create your own harmonisation algorithm. 




Figure 1: Visualisation of the total number of names before cleaning, and after cleaning with each of the three techniques: the one showed in the notebook, the PSN standardisation, and the HAN harmonisation.

 

This notebook does not aim to provide a definitive or fully optimised solution. Instead, it serves as a practical guide, illustrating techniques and evaluating their impact by measuring dataset size reduction after clustering. The results are also compared with existing PSN and HAN methods to assess their relative effectiveness. 

We encourage you to clone the notebook and customise it to suit your needs. 


Keywords: data processing, visualisation, patent data analysis, harmonisation, PATSTAT