Crawling the german health web: exploratory study and graph analysis

Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain To better understand the nature of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zowalla, Richard (VerfasserIn) , Wetter, Thomas (VerfasserIn) , Pfeifer, Daniel (VerfasserIn)
Dokumenttyp: Article (Journal)
Sprache:Englisch
Veröffentlicht: 24.07.2020
In: Journal of medical internet research
Year: 2020, Jahrgang: 22, Heft: 7
ISSN:1438-8871
DOI:10.2196/17853
Online-Zugang:Resolving-System, Volltext: https://doi.org/10.2196/17853
Volltext
Verfasserangaben:Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing

MARC

LEADER 00000caa a2200000 c 4500
001 1742035108
003 DE-627
005 20220819043235.0
007 cr uuu---uuuuu
008 201204s2020 xx |||||o 00| ||eng c
024 7 |a 10.2196/17853  |2 doi 
035 |a (DE-627)1742035108 
035 |a (DE-599)KXP1742035108 
035 |a (OCoLC)1341382939 
040 |a DE-627  |b ger  |c DE-627  |e rda 
041 |a eng 
084 |a 33  |2 sdnb 
100 1 |a Zowalla, Richard  |d 1990-  |e VerfasserIn  |0 (DE-588)1222745283  |0 (DE-627)1742033598  |4 aut 
245 1 0 |a Crawling the german health web  |b exploratory study and graph analysis  |c Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing 
264 1 |c 24.07.2020 
300 |a 22 
336 |a Text  |b txt  |2 rdacontent 
337 |a Computermedien  |b c  |2 rdamedia 
338 |a Online-Ressource  |b cr  |2 rdacarrier 
500 |a Gesehen am 04.12.2020 
520 |a Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). Objective: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. Methods: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results: In total, n=22,405 seed URLs with country-code top level domains de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines. 
650 4 |a classification 
650 4 |a distributed system 
650 4 |a health information 
650 4 |a information 
650 4 |a internet 
650 4 |a search 
650 4 |a web crawling 
700 1 |a Wetter, Thomas  |d 1953-  |e VerfasserIn  |0 (DE-588)141236124  |0 (DE-627)703920774  |0 (DE-576)322863252  |4 aut 
700 1 |a Pfeifer, Daniel  |e VerfasserIn  |0 (DE-588)1222752948  |0 (DE-627)1742041809  |4 aut 
773 0 8 |i Enthalten in  |t Journal of medical internet research  |d Richmond, Va. : Healthcare World, 1999  |g 22(2020,7) Artikel-Nummer e17853, 22 Seiten  |h Online-Ressource  |w (DE-627)324614136  |w (DE-600)2028830-X  |w (DE-576)281198233  |x 1438-8871  |7 nnas  |a Crawling the german health web exploratory study and graph analysis 
773 1 8 |g volume:22  |g year:2020  |g number:7  |g extent:22  |a Crawling the german health web exploratory study and graph analysis 
856 4 0 |u https://doi.org/10.2196/17853  |x Resolving-System  |x Verlag  |3 Volltext 
951 |a AR 
992 |a 20201204 
993 |a Article 
994 |a 2020 
998 |g 141236124  |a Wetter, Thomas  |m 141236124:Wetter, Thomas  |d 910000  |d 999701  |e 910000PW141236124  |e 999701PW141236124  |k 0/910000/  |k 1/910000/999701/  |p 2 
998 |g 1222745283  |a Zowalla, Richard  |m 1222745283:Zowalla, Richard  |d 50000  |e 50000PZ1222745283  |k 0/50000/  |p 1  |x j 
999 |a KXP-PPN1742035108  |e 381729137X 
BIB |a Y 
SER |a journal 
JSO |a {"recId":"1742035108","language":["eng"],"type":{"bibl":"article-journal","media":"Online-Ressource"},"note":["Gesehen am 04.12.2020"],"title":[{"title_sort":"Crawling the german health web","subtitle":"exploratory study and graph analysis","title":"Crawling the german health web"}],"person":[{"display":"Zowalla, Richard","roleDisplay":"VerfasserIn","role":"aut","family":"Zowalla","given":"Richard"},{"family":"Wetter","given":"Thomas","roleDisplay":"VerfasserIn","display":"Wetter, Thomas","role":"aut"},{"given":"Daniel","family":"Pfeifer","role":"aut","roleDisplay":"VerfasserIn","display":"Pfeifer, Daniel"}],"relHost":[{"physDesc":[{"extent":"Online-Ressource"}],"id":{"issn":["1438-8871"],"zdb":["2028830-X"],"eki":["324614136"]},"origin":[{"publisherPlace":"Richmond, Va.","dateIssuedDisp":"1999-","publisher":"Healthcare World","dateIssuedKey":"1999"}],"language":["eng"],"recId":"324614136","disp":"Crawling the german health web exploratory study and graph analysisJournal of medical internet research","type":{"bibl":"periodical","media":"Online-Ressource"},"titleAlt":[{"title":"JMIR"}],"part":{"extent":"22","volume":"22","text":"22(2020,7) Artikel-Nummer e17853, 22 Seiten","issue":"7","year":"2020"},"pubHistory":["1.1999 -"],"title":[{"title":"Journal of medical internet research","subtitle":"international scientific journal for medical research, information and communication on the internet ; JMIR","title_sort":"Journal of medical internet research"}]}],"physDesc":[{"extent":"22 S."}],"id":{"eki":["1742035108"],"doi":["10.2196/17853"]},"origin":[{"dateIssuedKey":"2020","dateIssuedDisp":"24.07.2020"}],"name":{"displayForm":["Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing"]}} 
SRT |a ZOWALLARICCRAWLINGTH2407