There are increasing opportunities to use enormous data sets in research, but getting started can be a daunting task, and pitfalls abound, said panelists at the 2021 American Academy of Otolaryngology–Head and Neck Surgery Annual Meeting. Experts discussed the main features, benefits, and drawbacks of the largest databases—from children’s to insurance, to national representative to the academy’s own registry—while emphasizing that even though using them correctly isn’t a simple matter, it is a vital one.
“There are data that are relevant to our healthcare and health outcomes all around us and in many different sources,” said Derek Lam, MD, MPH, moderator of the panel and associate professor of otolaryngology–head and neck surgery at Oregon Health and Science University in Portland. “Once the data are there, then they have to be accessed, cleaned, and processed. The descriptive analytics involved are often very complicated.”
Differences in Databases
These large databases have strong statistical power and generalizability, and usually precision in their data collection, but care is needed in interpreting the data, and there can be problems with accuracy in administrative data collection, Dr. Lam added.
The panelists reviewed several databases:
- IBM MarketScan Commercial Claims and Encounters Database: This includes insurance claim data for about 50 million privately insured Americans each year. The data include inpatient, outpatient, emergency department, and pharmaceutical claims in a database that was created in 1996 and set up so that individuals’ data can be linked across insurers. But it’s expensive—five years of data costs about $25,000, and an experienced programmer is needed to analyze the data in what can be a time-consuming process, Dr. Lam said.
- American College of Surgeons’ National Surgical Quality Improvement Program (NSQIP): This database uses a nationally validated outcomes-based approach to measure the quality of surgical care. Its institutional dat , which includes clinical data from the medical record and is not administrative, and the data is collected by a trained, certified data collector. Patients can be followed over time and outcomes that are procedure specific and long term can be assessed. Its semi-annual report of data allows benchmarks from an institution to be compared to 102 other institutions in a risk-adjusted way. One example of a study using NSQIP data is a look at safety and postoperative adverse events in pediatric otologic surgery, Dr. Lam said.
- Kids Inpatient Database: As its name suggests, this is a database of pediatric inpatient care that is publicly available and includes 2 to 3 million discharges a year, said Nikhila P. Raol, MD, MPH, assistant professor of otolaryngology– head and neck surgery at Emory University School of Medicine in Atlanta. Because it is based on hospital encounters, the database can’t be used to track patients over time. To get a sample that is nationally representative, the results must be weighted, Dr. Raol said. A convenient feature is that users can query the database with a research question and find out whether the database can answer
that question. “You say, ‘Do I have the number of patients that I need to answer this question?’ or ‘Do these complications occur frequently enough? Is this condition captured enough?’” she said. Dr. Raol has used it to examine whether there was a difference in cost in tonsillectomy depending on where the surgery was done. - Pediatric Health Information System: This includes inpatient, ambulatory surgery, observation, and emergency encounters from 52 children’s hospitals. The data from 1999 are longitudinal, meaning that patients’ variables were repeatedly observed over periods of time, so outcomes and utilization can be looked at over time and patients can be tracked across their encounters. This database, Dr. Raol said, would be appropriate for research questions how often a disease occurs in a population, how frequently a procedure is performed in a given population, or how often a certain comorbidity is present among hospitalized children with a specific diagnosis. She cautioned that any care that occurs outside the children’s hospital system won’t be captured, so some episodes of care will be missed for individual patients, even though this is a longitudinal database.
- National Hospital Ambulatory Medical Care Surveys: Data in this set are collected in annual installments and are captured as provider– patient visits. The database includes office-based and hospital- based visits, but not administrative data, electronic data, radiology information, anesthesia, or other kinds of data.
Advantages and Pitfalls
“What’s particularly compelling about these national databases is that they’re designed so that the results can be representative of how the whole United States utilizes and provides ambulatory care,” said Jennifer J. Shin, MD, SM, associate professor of otolaryngology– head and neck surgery at Harvard Medical School in Boston.
The AAO-HNS Reg-ent database continues to grow, with more than 19 million office visits recorded in 2020, up from about 3 million in 2016. In addition, there is a good spread in the age of the patients for which data are collected, Dr. Shin said. “Even in the age groups that are the least populated, there’s still a pretty good amount of patients, and this is really helpful,” she said. “In Reg-ent, the population is nicely filled out. It looks to be a really good resource from that respect.”
David O. Francis, MD, associate professor of surgery at the University of Wisconsin, Madison, cautioned that pitfalls abound when using big data. Hundreds of paper submissions are never even sent out for peer review because of flaws in their technique or because they try to answer a research question that can’t be addressed with the dataset in use.
What’s particularly compelling about these national databases is that they’re designed so that the results can be representative of how the whole United States utilizes and provides ambulatory care. —Jennifer J. Shin, MD, SM
“Most of us, when we consider these database projects, especially when we read them, just think, ‘Hey these people put a bunch of data into the computer and it spit out facts,’” he said. “But to do it properly is much more complicated because the data are imperfect. All data collected are imperfect. Data are collected for different purposes.” There’s a risk of spreading false information if data use isn’t properly considered.
When setting out on a big data research project, said Dr. Francis, researchers should:
- Make sure their project is hypothesis driven, rather than making use of data mining in search of relationships. With such large amounts of data, statistically significant associations are plentiful, but that doesn’t mean they aren’t spurious or that the associations are relevant.
- Seek institutional review board approval and comply with data use agreements.
- Do the “homework” of understanding the peculiarities of the database and make sure to use appropriate variables and methodologies— administrative data and clinical data are not the same, for instance.
- Clearly define inclusion and exclusion criteria.
- Identify potential confounders and use risk adjustment to minimize bias.
- Account for updates and changes to variables over time.
- Identify and address competing risks.
- Determine how to handle missing data.
- Have a clear take-home message.
“This should all be thought about before you even start the study,” he said. “What are you trying to say, what are you trying to study? As you move on, you need to think about what your message is and how your research advances current knowledge.”
Thomas R. Collins is a freelance medical writer based in Florida.