IPLodB Aligning Data Sources
We wish to acknowledge the following data sources and their providers:
- European Patent Office Linked open EP data (EP LOD) dataset covers patent data, including: inventor names, applicant names, timestamps (dates), relevant numbers, titles in English, technical classification (IPC/CPC) and information on examiners’ citations. The dataset Uniform Resource Identifiers (URIs) to identify patent applications, publications and other resources present in patent data. Data is updated weekly. They provide an API and a SPARQL endpoint. They also provide a bulk download.
- Springer Nature SciGraph (SN LOD) provides corroborating LOD focusing on scientific publication: articles, chapters, books, journals, people, grants, etc. SciGraph collates information from across the research landscape, for example funders, research projects, conferences, affiliations and publications. They provide a bulk download.
- The Global Research Identifier Database (GRID) uses uniform resource identifiers (URIs), hence it is another linked open database. GRID includes information on almost 100,000 organizations, out of which about 30% are companies, 20% are higher education institutions (HEI), with about 10% nonprofit and 10% hospitals. The database includes several variables, such as address, type, the URl of the organization etc. It provides a bulk download.
- The Crossref contains over 120 million records and is one of the major sources of scholarly data for publishers, authors, librarians, funders, and researchers. The metadata set consists of 13 content types, including not only traditional types, such as journals and conference papers, but also data sets, reports, preprints, peer reviews, and grants. The metadata is available through a number of APIs, including REST API and OAI-PMH. The metadata is not limited to basic publication metadata. It can also include e.g. abstracts and links to full text, funding and license information, citation links, and the information about corrections, updates, retractions.
- GeoNames is a freely available geographical database covering all countries and containing over eleven million place names that are available for download. They provide URIs for their data, but also have a data dump available.
- ISO-3166-Countries-with-Regional-Codes is a country abbreviation list, with the data dump available in the .csv format on the GitHub repository.
- The World Gender-Name Dictionary (WGND) compiles the information from 13 different sources (from either national public institutions and previous gender studies and including some limited manual check) on gender attribution for first names. Combined this dataset covers over 173 different countries and includes 6.2 million names for 182 different countries disambiguating the names for PCT inventors. Data dump is also available.
- Two more mentionables: The Geocoding of worldwide patent data and the PatCit. The Geocoding of worldwide patent data entails a dataset of priority patent applications filed across the globe, allocated by inventor and applicant location. For more also see: De Rassenfosse, G., Kozak, J., & Seliger, F. (2019): Geocoding of worldwide patent data. Scientific data, 6(1), 1-15 . The PatCit makes use of front-page patent citations by extracting and structuring these citations. The data has aided us in the disambiguation efforts.
- For several attributes related to patent application records, we extracted data also from Patstat Global - the core data product by the EPO, which contains data on patents.