Crawling the german health web: exploratory study and graph analysis

Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain To better understand the nature of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zowalla, Richard (VerfasserIn) , Wetter, Thomas (VerfasserIn) , Pfeifer, Daniel (VerfasserIn)
Dokumenttyp:	Article (Journal)
Sprache:	Englisch
Veröffentlicht:	24.07.2020
In:	Journal of medical internet research Year: 2020, Jahrgang: 22, Heft: 7
ISSN:	1438-8871
DOI:	10.2196/17853
Online-Zugang:	Resolving-System, Volltext: https://doi.org/10.2196/17853
Verfasserangaben:	Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing

MARC


LEADER	00000caa a2200000 c 4500
001	1742035108
003	DE-627
005	20220819043235.0
007	cr uuu---uuuuu
008	201204s2020 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.2196/17853 \|2 doi
035			\|a (DE-627)1742035108
035			\|a (DE-599)KXP1742035108
035			\|a (OCoLC)1341382939
040			\|a DE-627 \|b ger \|c DE-627 \|e rda
041			\|a eng
084			\|a 33 \|2 sdnb
100	1		\|a Zowalla, Richard \|d 1990- \|e VerfasserIn \|0 (DE-588)1222745283 \|0 (DE-627)1742033598 \|4 aut
245	1	0	\|a Crawling the german health web \|b exploratory study and graph analysis \|c Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing
264		1	\|c 24.07.2020
300			\|a 22
336			\|a Text \|b txt \|2 rdacontent
337			\|a Computermedien \|b c \|2 rdamedia
338			\|a Online-Ressource \|b cr \|2 rdacarrier
500			\|a Gesehen am 04.12.2020
520			\|a Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). Objective: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. Methods: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results: In total, n=22,405 seed URLs with country-code top level domains de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.
650		4	\|a classification
650		4	\|a distributed system
650		4	\|a health information
650		4	\|a information
650		4	\|a internet
650		4	\|a search
650		4	\|a web crawling
700	1		\|a Wetter, Thomas \|d 1953- \|e VerfasserIn \|0 (DE-588)141236124 \|0 (DE-627)703920774 \|0 (DE-576)322863252 \|4 aut
700	1		\|a Pfeifer, Daniel \|e VerfasserIn \|0 (DE-588)1222752948 \|0 (DE-627)1742041809 \|4 aut
773	0	8	\|i Enthalten in \|t Journal of medical internet research \|d Richmond, Va. : Healthcare World, 1999 \|g 22(2020,7) Artikel-Nummer e17853, 22 Seiten \|h Online-Ressource \|w (DE-627)324614136 \|w (DE-600)2028830-X \|w (DE-576)281198233 \|x 1438-8871 \|7 nnas \|a Crawling the german health web exploratory study and graph analysis
773	1	8	\|g volume:22 \|g year:2020 \|g number:7 \|g extent:22 \|a Crawling the german health web exploratory study and graph analysis
856	4	0	\|u https://doi.org/10.2196/17853 \|x Resolving-System \|x Verlag \|3 Volltext
951			\|a AR
992			\|a 20201204
993			\|a Article
994			\|a 2020
998			\|g 141236124 \|a Wetter, Thomas \|m 141236124:Wetter, Thomas \|d 910000 \|d 999701 \|e 910000PW141236124 \|e 999701PW141236124 \|k 0/910000/ \|k 1/910000/999701/ \|p 2
998			\|g 1222745283 \|a Zowalla, Richard \|m 1222745283:Zowalla, Richard \|d 50000 \|e 50000PZ1222745283 \|k 0/50000/ \|p 1 \|x j
999			\|a KXP-PPN1742035108 \|e 381729137X
BIB			\|a Y
SER			\|a journal
JSO			\|a {"recId":"1742035108","language":["eng"],"type":{"bibl":"article-journal","media":"Online-Ressource"},"note":["Gesehen am 04.12.2020"],"title":[{"title_sort":"Crawling the german health web","subtitle":"exploratory study and graph analysis","title":"Crawling the german health web"}],"person":[{"display":"Zowalla, Richard","roleDisplay":"VerfasserIn","role":"aut","family":"Zowalla","given":"Richard"},{"family":"Wetter","given":"Thomas","roleDisplay":"VerfasserIn","display":"Wetter, Thomas","role":"aut"},{"given":"Daniel","family":"Pfeifer","role":"aut","roleDisplay":"VerfasserIn","display":"Pfeifer, Daniel"}],"relHost":[{"physDesc":[{"extent":"Online-Ressource"}],"id":{"issn":["1438-8871"],"zdb":["2028830-X"],"eki":["324614136"]},"origin":[{"publisherPlace":"Richmond, Va.","dateIssuedDisp":"1999-","publisher":"Healthcare World","dateIssuedKey":"1999"}],"language":["eng"],"recId":"324614136","disp":"Crawling the german health web exploratory study and graph analysisJournal of medical internet research","type":{"bibl":"periodical","media":"Online-Ressource"},"titleAlt":[{"title":"JMIR"}],"part":{"extent":"22","volume":"22","text":"22(2020,7) Artikel-Nummer e17853, 22 Seiten","issue":"7","year":"2020"},"pubHistory":["1.1999 -"],"title":[{"title":"Journal of medical internet research","subtitle":"international scientific journal for medical research, information and communication on the internet ; JMIR","title_sort":"Journal of medical internet research"}]}],"physDesc":[{"extent":"22 S."}],"id":{"eki":["1742035108"],"doi":["10.2196/17853"]},"origin":[{"dateIssuedKey":"2020","dateIssuedDisp":"24.07.2020"}],"name":{"displayForm":["Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing"]}}
SRT			\|a ZOWALLARICCRAWLINGTH2407

Crawling the german health web: exploratory study and graph analysis

MARC

Ähnliche Einträge