Sabbir M. Rashid is a Ph.D. candidate at Rensselaer Polytechnic Institute working with Professor Deborah McGuinness on research related to data annotation and harmonization, ontology engineering, knowledge representation, and various forms of reasoning. Prior to attending RPI, Sabbir completed a double major at Worcester Polytechnic Institute, where he received B.S. degrees in both Physics and Electrical & Computer Engineering. Much of his graduate studies at RPI have involved the research discussed in this tutorial. His current work includes the application of deductive and abductive inference techniques over linked health data, such as in the context of chronic diseases like diabetes.
2021 Workshops and Tutorials: Annotating Tabular Data using Semantic Data Dictionaries
It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. This work has been evaluated in comparison with traditional data dictionaries, mapping languages, and data integration tools.