TY - GEN
T1 - Prediction of population health indices from social media using kernel-based textual and temporal features
AU - Nguyen, Thin
AU - Nguyen, Duc Thanh
AU - Larsen, Mark E.
AU - O'Dea, Bridianne
AU - Yearwood, John
AU - Phung, Dinh
AU - Venkatesh, Svetha
AU - Christensen, Helen
PY - 2017
Y1 - 2017
N2 - From 1984, the US has annually conducted the Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture either health behaviors, such as drinking or smoking, or health outcomes, including mental, physical, and generic health, of the population. Although this kind of information at a population level, such as US counties, is important for local governments to identify local needs, traditional datasets may take years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. In this work, to predict the percentage of adults in a county reporting“insufficient sleep”, a health behavior, and, at the same time, their health outcomes, novel textual and temporal features are proposed. The proposed textual features are defined at mid-level and can be applied on top of various low-level textual features. They are computed via kernel functions on underlying features and encode the relationships between individual underlying features over a population. To further enrich the predictive ability of the health indices, the textual features are augmented with temporal information. We evaluated the proposed features and compared them with existing features using a dataset collected from the BRFSS. Experimental results show that the combination of kernel-based textual features and temporal information predict well both the health behavior (with best performance at rho=0.82) and health outcomes (with best performance at rho=0.78), demonstrating the capability of social media data in prediction of population health indices. The results also show that our proposed features gained higher correlation coefficients than did the existing ones, increasing the correlation coefficient by up to 0.16, suggesting the potential of the approach in a wide spectrum of applications on data analytics at population levels.
AB - From 1984, the US has annually conducted the Behavioral Risk Factor Surveillance System (BRFSS) surveys to capture either health behaviors, such as drinking or smoking, or health outcomes, including mental, physical, and generic health, of the population. Although this kind of information at a population level, such as US counties, is important for local governments to identify local needs, traditional datasets may take years to collate and to become publicly available. Geocoded social media data can provide an alternative reflection of local health trends. In this work, to predict the percentage of adults in a county reporting“insufficient sleep”, a health behavior, and, at the same time, their health outcomes, novel textual and temporal features are proposed. The proposed textual features are defined at mid-level and can be applied on top of various low-level textual features. They are computed via kernel functions on underlying features and encode the relationships between individual underlying features over a population. To further enrich the predictive ability of the health indices, the textual features are augmented with temporal information. We evaluated the proposed features and compared them with existing features using a dataset collected from the BRFSS. Experimental results show that the combination of kernel-based textual features and temporal information predict well both the health behavior (with best performance at rho=0.82) and health outcomes (with best performance at rho=0.78), demonstrating the capability of social media data in prediction of population health indices. The results also show that our proposed features gained higher correlation coefficients than did the existing ones, increasing the correlation coefficient by up to 0.16, suggesting the potential of the approach in a wide spectrum of applications on data analytics at population levels.
KW - Cognitive computing
KW - Feature engineering
KW - Geo-referenced tweets
KW - Kernel-based features
KW - Online texts
KW - Population health indices
KW - Prediction
KW - Temporal information
KW - Textual features
UR - http://www.scopus.com/inward/record.url?scp=85046689901&partnerID=8YFLogxK
U2 - 10.1145/3041021.3054136
DO - 10.1145/3041021.3054136
M3 - Conference contribution
AN - SCOPUS:85046689901
T3 - 26th International World Wide Web Conference 2017, WWW 2017 Companion
SP - 99
EP - 107
BT - 26th International World Wide Web Conference 2017, WWW 2017 Companion
PB - International World Wide Web Conferences Steering Committee
T2 - 26th International World Wide Web Conference, WWW 2017 Companion
Y2 - 3 April 2017 through 7 April 2017
ER -