My work involving big data and new measures focuses on developing an integrated, consistent, and validated tool set to support theory-driven research.

Patent Citations

(Kuhn, Younge, and Marco 2016, working paper)

This project applies novel patent-to-patent similarity data (described below) to examine the information content of patent citations. Existing measures of innovation often rely on patent citations to indicate intellectual lineage and impact. We show that the data generating process for patent citations has changed substantially since citation-based measures were validated a decade ago.

Today, far more citations are created per patent, and the mean technological similarity between citing and cited patents has fallen significantly. These changes suggest that the use of patent citations for scholarship needs to be re-validated. We propose a basic correction and show that methods for sub-setting and/or weighting informative citations can substantially improve the predictive power of patent citation measures.

This project is joint work with Kenneth Younge, who holds the Chair of Technology and Innovation Strategy at EPFL, and Alan Marco, the Chief Economist of the United States Patent and Trademark Office. I presented this paper at the Searle Center Conference on Innovation Economics and the Academy of Management Annual Meeting. It was also presented at the Munich Summer Institute. We plan to make the data available for public use.

Patent-to-Patent Similarity

(Younge and Kuhn 2015, working paper, data access)

Concepts of technological space, distance, and relatedness are central to the study of invention and innovation. Empirical studies of technological space generally rely on manual classification of patents by the patent office or the linking of patents through prior art citations. In this project, we demonstrate that these approaches are simply too coarse or too biased for many of the comparisons or groupings required for academic research. We introduce a new, continuous measure of technological similarity based on a vector space model.

We apply the model to calculate the pairwise similarity for more than 14 trillion pairs of patents. We validate the measure and demonstrate that it can provide greater accuracy, specificity, and generality than existing approaches for many common research questions. Moreover, we illustrate how a pairwise similarity comparison of any and every two patents in the USPTO patent space can open new avenues of research in economics, management, and public policy.

This project is also joint work with Kenneth Younge. I have presented this paper at the USPTO Visiting Speaker Series, the Academy of Management Annual Meeting, and SKEMA Business School (Sophia Antipolis). It was also presented at the DRUID Academy Conference 2016.

Other Patent Data

In addition to the projects described above, I have developed several patent data sets that I apply across different projects.

  • Patent rejections: Most patent applications are initially rejected at the patent office. The patent applicant typically responds by narrowing the claims to a scope acceptable to the patent examiner. The patent examiner rejects a claim based on one or more prior art references. Patent citations represent the set of references that the examiner considered, but existing data sets do not identify the specific citations used to reject the patent and shape the claims. I identified these references by using tens of thousands of computers in the cloud to apply optical character recognition (OCR) algorithms to more than 50 million pages of patent office correspondence. I use this rejection data in several papers including (Kuhn and Younge 2016, working paper) and (Thompson and Kuhn 2016, working paper).
  • Patent assignments: When a patent is sold, the buyer records the transfer at the patent office as an assignment document. However, these documents identify the buyer and seller by a self-provided name rather than a unique identifier. I disambiguate these records for all assignments to uniquely identifier buyers and sellers for all patents, including transfers both before and after a patent has issued. I use this patent assignment data to investigate the role of patents in the market for ideas.
  • Bibliographic patent data: I parse and integrate data from a variety of sources to create a unified database of bibliographic patent data. This databases includes dates, identifiers, citations, maintenance fee payments, examination records, priority relationships, and many other fields. I apply this data across many of my empirical projects.