Working with Large Datasets in Thomson Data Analyzer

Overview:

When you are working with large datasets in Thomson Data Analyzer (TDA), you may see an error message that you are running out of RAM. The following guidelines may help you to free up some system memory and continue to work. Which guidelines to apply will depend on your analytical needs and where you are in the workflow process. Strategies discussed in this document include:

  • Use a 64-bit Operating System and Install the maximum amount of RAM supported by your computer
  • Close other programs that are not essential to your analysis.
  • Import a small number of fields at first; Use “Import More Fields” to add other fields later.

Use a 64-bit OS and Install the Maximum RAM

Thomson Data Analyzer is a 32-bit application, and is subject to the per-process memory usage limits of the operating system. These limits exist regardless of how much physical memory the computer has installed.

If you are using Thomson Data Analyzer on a 32-bit version of Windows, the maximum amount of memory that TDA can use is 2 gigabytes. On a 64-bit Windows system, Thomson Data Analyzer can use up to 3 GB.

Close Non-essential Programs and *.vpt Files.

If you have other applications running that are not essential to your workflow, close them to make more system memory available for TDA to use. If you have more than one Thomson Data Analyzer data file (*.vpt) open, close all open data files except the one in which you are currently working.

Maintain a Dataset with as Few Fields as Possible

When you maintain a dataset with only the essential fields, you also keep the size (in MB) of the *.vpt file on the disk as small as possible. This is especially important when you import raw data files, and it is advisable to import only the “Title” field at first, so you do not run out of memory before you save the *.vpt file to a disk. Once you have saved your dataset as a *.vpt file, exit and restart TDA (to free up as much memory as possible) and open your saved dataset. You can use “Import More Fields” (from TDA’s “Fields” menu to add other fields you need after your data is imported and saved to a *.vpt file.

Here is a useful table to guide you in selecting a minimum set of fields:

Minimum Field Set
for TDA Cleanup Macro

 

New Fields added
by Cleanup Macro

 

Additional Fields Needed
by TDA Reporting Macros
[
defaults]

  1. Derwent Accession Number
  2. Inventors
  3. Patent Assignee Codes
  4. Patent Assignees
  5. Patent Assignees (long)
 
  1. Inventors (Cleaned)
  2. Patent Assignees (Cleaned - No Individuals)
  3. Patent Assignees (Cleaned)
 
  1. Country [Priority Countries]
  2. Year [Priority Years (earliest)]
  3. Technology [Manual Codes]

 

Use discretion when choosing which fields to add. Whenever possible, avoid importing fields with Long Text (e.g. Patent Claims, Abstracts, etc.)

Fields with a very large number of items will also consume a lot of system resources. Examples of such fields include:

  • Fields with “NLP” Words or Phrases
  • “Cited References” fields and fields derived from Cited References (e.g. “Cited Authors” or “Cited Journals”).
  • Authors, Inventors, Full Organization Names, or fields with Uncontrolled vocabulary terms (see note below)

Note: Delete existing large fields that are not in use, but only if they can be readily imported again using “Import More Fields.” Use caution not to delete fields that have Groups you want to keep or “Cleaned” fields. “Cleaned” fields cannot be readily re-imported with “Import More Fields,”[1] (but the originating field on which the cleaning was done can usually be safely deleted.)

Fields that include a lot of items also tend to have long tails on their record frequency distributions. That is, a vast majority of the terms will occur in only one or two records. When this is the case, consider creating a group of all terms that occur in at least N records. You can then use “Create Field using Group Items” to make a new field with far fewer items, and delete the originating, much larger field.



[1] If you saved your List Cleanup work as a thesaurus, you can re-import the original field, and run your saved thesaurus on that field.