Domain-specific text dictionaries for text analytics


Journal article


A. Villanes, C. G. Healey
International Journal of Data Science and Analytics, vol. 15, 2023, pp. 105-118

View PDF Semantic Scholar DOI
Cite

Cite

APA   Click to copy
Villanes, A., & Healey, C. G. (2023). Domain-specific text dictionaries for text analytics. International Journal of Data Science and Analytics, 15, 105–118.


Chicago/Turabian   Click to copy
Villanes, A., and C. G. Healey. “Domain-Specific Text Dictionaries for Text Analytics.” International Journal of Data Science and Analytics 15 (2023): 105–118.


MLA   Click to copy
Villanes, A., and C. G. Healey. “Domain-Specific Text Dictionaries for Text Analytics.” International Journal of Data Science and Analytics, vol. 15, 2023, pp. 105–18.


BibTeX   Click to copy

@article{a2023a,
  title = {Domain-specific text dictionaries for text analytics},
  year = {2023},
  journal = {International Journal of Data Science and Analytics},
  pages = {105-118},
  volume = {15},
  author = {Villanes, A. and Healey, C. G.}
}

Abstract

We investigate the use of sentiment dictionaries to estimate sentiment for large document collections. Our goal in this paper is a semiautomatic method for extending a general sentiment dictionary for a specific target domain in a way that minimizes manual effort. General sentiment dictionaries may not contain terms important to the target domain or may score terms in ways that are inappropriate for the target domain. We combine statistical term identification and term evaluation using Amazon Mechanical Turk to extend the EmoLex sentiment dictionary to a domain-specific study of dengue fever. The same approach can be applied to any term-based sentiment dictionary or target domain. We explain how terms are identified for inclusion or re-evaluation and how Mechanical Turk generates scores for the identified terms. Examples are provided that compare EmoLex sentiment estimates before and after it is extended. We conclude by describing how our sentiment estimates can be integrated into an epidemiology surveillance system that includes sentiment visualization and discussing the strengths and limitations of our work.