Methods for Rice Genome Landscaping
In this landscape, we have determined the extent to which the rice genome nucleotide sequences are recited in the claims of both issued U.S. patents and U.S. patent applications. Our process entailed a number of informatics steps that are outlined below.
In summary, we compiled a database of patent nucleotide sequences that are recited in the claims of granted U.S. patents and U.S. patent applications, and compared these sequences to the published rice genome using MEGABLAST. We determined which sequences were highly homologous to sequences in the rice genome and mapped these sequences to the corresponding location on the rice chromosome. We only included sequence matches that yielded a BLAST E value of 1e-200 or less, which is highly statistically significant. The results of our analysis are shown in the subsequent pages of this landscape.
It is important to note that the patent and patent application sequences used in this analysis were selected without reference to which genome they are from. For example, if a maize sequence is nearly identical to a rice sequence, then it will be included in our list of results. We have chosen to include such sequences because the inherent similarity of plant genomes makes it possible for patents that claim one species to dominate another. For example, it is possible that a patent claiming a maize sequence can result in exclusionary treatment of the corresponding rice sequence. Chapter 3 discusses this concept in more detail.
1. Compilation of a searchable rice genome database
We started with the most recent rice genome sequences from the TIGR Rice Genome Annotation web site and then used the formatdb program from NCBI to convert the data to a searchable BLAST database.
2. Compilation of sequence databases for granted patents and patent applications
Applications
For patent applications, we acquired the sequences of the bulk sequence applications from the Publication Site for Issued and Published Sequences (PSIPS) web site. This web site provides sequence listings for U.S. patents and patent applications with sequence listings that are longer than 300 pages. We also acquired the sequence listings for the non-bulk sequence listings (fewer than 300 pages in length) that are published by the USPTO as an XML document. For each of the listing types (bulk sequence and non-bulk sequence), there was a separate file for nucleotides and amino acids. Data for U.S. applications are available since 2001, when patent applications began to be published.
The bulk and non-bulk sequence listings were then converted to a common data format (FASTA) and combined to create one database for nucleotide sequences, and one database for amino acid sequences. Additionally, each of these combined databases was converted to a searchable BLAST database for use with CAMBIA's patent sequence search tool.
Granted (Issued) Patents
For granted U.S. Patents, we had a data source that wasn't available for the applications; GenBank at NCBI has a searchable patent database of sequences disclosed in granted patents. To create our granted patents sequence database, we started by acquiring the U.S. patent sequences from GenBank. This required removing all sequences that originated from non-U.S. patents.
We then acquired the sequence listings from the bulk and non-bulk patents in the manner described above in the Applications section. The data from all three sources (GenBank, bulk, and non-bulk) were converted to a common format. We then carried out a filtering step that removed any duplicate sequences in the data provided by GenBank, and the sequences provided by the USPTO (bulk and non-bulk).
The identical process was carried out for nucleotide sequences and amino acid sequences, however, our analysis currently is confined to nucleotide sequences. As with the patent applications, each of these combined databases was converted to a searchable BLAST database for use with CAMBIA's patent sequence search tool.
3. Identification of sequences that are recited in the claims of granted patents and patent applications
A key feature of our analysis is parsing out the sequences that were recited in the claims of patents and patent applications, rather than just disclosed in the specification. The goal is to ultimately identify sequences that are claimed in patents and applications, but normally a review of the claim language by a human being is required to determine whether sequences that are mentioned in claims are actually claimed. To this end, we created four databases that contain only the sequences that are mentioned in the claims of patents and patent applications. These four databases correspond to nucleotide sequences in applications, amino acid sequences in applications, nucleotide sequences in granted patents, and amino acid sequences in granted patents.
We compiled a list of common phrases that are used to identify sequence listings in claims. This step was tricky, as there are many different phrases that patent applicants use to designate sequence listings in claims. Using these phrases, we created a list of sequence ID numbers that are designated in patent claims. Four new databases were then created that contain only the sequences mentioned in the claims of patents and patent applications.
4. MEGABLAST search of rice genome database using the sequences mentioned in claims as input
After compiling a collection of sequences that are mentioned in the claims of patents and applications, we then used those sequences to query the rice genome database (see step 1) using MEGABLAST to identify sequences that have significant homology to sequences in the rice genome. The criteria for matches in the database was that they have a BLAST E value less than 1e-200. We performed this analysis only with the nucleotide sequences mentioned in the claims of patents and patent applications.
5. Plotting the results of the analysis
The sequences with significant homology to sequences in the rice genome were plotted three different ways.
- Sequence Count. For these plots, the Oryza sativa genome was divided into 300 kbp segments. For patent applications that claim Oryza sativa gene sequences, we plotted the number of sequences that match at least a 150 bp fragment of each 300 kbp genome segment. For each sequence, only the highest-scoring genome match was counted.
- Patent Count. As with the Sequence Count plots, the Oryza sativa genome was divided into 300 kbp segments. For patent applications that claim Oryza sativa gene sequences, we plotted the number of patent applications that claim a 150 base pair or longer fragment of each 300 kbp genome segment. For each sequence, only the highest-scoring genome match was counted.
- Percent Genome Coverage. For these plots, we plotted the percentage of each genome segment that was recited in the claims of patent applications. Unlike the previous two plot types, there was no requirement that the fragments that matched the rice genome sequences be a minimum length (e.g., 150 base pairs). However, the minimum size for a positive match using MEGABLAST is 26 base pairs, so matches shorter than 26 were not included in the analysis. In addition, if multiple sequences covered the same portion of the genome, that portion of the genome was counted as being covered only once.
The end result of our analysis was a dataset of patents containing rice sequences in their claims, linked to the specific map location of those sequences on the rice genome. The sequences and matched genomic regions are made using highly specific search criteria, resulting in "exact" matching of sequences recited in the claims to their corresponding genomic sequence. No attempt was made to identify homologous sequences within the genome, which may also be claimed by the patent document. Hence the maps obtained are likely to under-represent the actual sequence claims in rice, from patents claiming rice sequences.



There are no comments.