These datasets are provided to the public  subject to the Creative Commons Attribution-NonCommercial-NoDerivatives license. No co‑authorship is required to use the data in academic research — please just cite the supporting article.

Patent Citation Similarity Dataset

From: Patent Citations Reexamined, by Jeffrey Kuhn, Kenneth Younge, Alan Marco

Many studies of innovation rely on patent citations to measure intellectual lineage and impact. To create this dataset, we use a vector space model of patent similarity to compute the technological similarity between each pair of citing-cited patents. The VSM model analyzes the full text of each document to position it as a vector in a vector space that includes more than 700,000 dimensions and then calculates the angular distance between the two vectors. The dataset includes similarity values for all citations made by patents issued between 1976 and 2017 to issued patents or published patent applications.

Download (819 MB)

Patent Scope and Examiner Toughness Dataset

From: How to Measure and Draw Causal Inferences with Patent Scope, by Jeffrey Kuhn, Neil Thompson

This dataset includes an easy-to-use measure of patent scope that is grounded both in patent law and in the practices of patent attorneys. Our measure counts the number of words in the patents’ first claim. The longer the first claim, the less scope a patent has. This is because a longer claim has more details – and all those details must be met for another invention to be infringing. Hence, the more details there are in the patent, the greater are the opportunities for others to invent around it. We validate our measure by showing both that patent attorneys’ subjective assessments of scope agree with our estimates, and that the behavior of patenters is consistent with it. To facilitate drawing causal inferences with our measure, we show how it can be used to create an instrumental variable, patent examiner Scope Toughness, which we also validate.

Download (33 MB)

Patent Citation Timing and Source Dataset

From: Patent Citations Reexamined, by Jeffrey Kuhn, Kenneth Younge, Alan Marco

Innovation studies frequently distinguish between patent citation submitted by the patent examiner and those submitted by the patent application. However, publicly available citations data is often misleading, for instance by attributing a patent citation to the patent examiner when it was in fact first submitted by the patent application. This dataset uses internal USPTO data to identify the date on which each citation was first submitted as well as the party (examiner or applicant) who first submitted it. The dataset includes observations for citations made by patents issued 2001-2014, although some level of leftward truncation is evident due to limitations in internal data availability at the USPTO.

Download (292 MB)

Patent Families Dataset

From: Patent-to-Patent Similarity: A Vector Space Model, by Kenneth Younge, Jeffrey Kuhn

Patent applicants frequently file groups of patent applications linked together by priority claims. These priority claims create families of patent applications that share features such as inventors, priority dates, and technical descriptions. By analyzing these linkages, each patent can be assigned a family identifier that it shares with other patents in the same family. This data set includes two levels of family identifiers (clone for near copies, and extended for more attenuated linkages) for each patent issued 2005-2014.

Download (18 MB)