Crawling the german health web: exploratory study and graph analysis
Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain To better understand the nature of...
Gespeichert in:
| Hauptverfasser: | , , |
|---|---|
| Dokumenttyp: | Article (Journal) |
| Sprache: | Englisch |
| Veröffentlicht: |
24.07.2020
|
| In: |
Journal of medical internet research
Year: 2020, Jahrgang: 22, Heft: 7 |
| ISSN: | 1438-8871 |
| DOI: | 10.2196/17853 |
| Online-Zugang: | Resolving-System, Volltext: https://doi.org/10.2196/17853 |
| Verfasserangaben: | Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing |
MARC
| LEADER | 00000caa a2200000 c 4500 | ||
|---|---|---|---|
| 001 | 1742035108 | ||
| 003 | DE-627 | ||
| 005 | 20220819043235.0 | ||
| 007 | cr uuu---uuuuu | ||
| 008 | 201204s2020 xx |||||o 00| ||eng c | ||
| 024 | 7 | |a 10.2196/17853 |2 doi | |
| 035 | |a (DE-627)1742035108 | ||
| 035 | |a (DE-599)KXP1742035108 | ||
| 035 | |a (OCoLC)1341382939 | ||
| 040 | |a DE-627 |b ger |c DE-627 |e rda | ||
| 041 | |a eng | ||
| 084 | |a 33 |2 sdnb | ||
| 100 | 1 | |a Zowalla, Richard |d 1990- |e VerfasserIn |0 (DE-588)1222745283 |0 (DE-627)1742033598 |4 aut | |
| 245 | 1 | 0 | |a Crawling the german health web |b exploratory study and graph analysis |c Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing |
| 264 | 1 | |c 24.07.2020 | |
| 300 | |a 22 | ||
| 336 | |a Text |b txt |2 rdacontent | ||
| 337 | |a Computermedien |b c |2 rdamedia | ||
| 338 | |a Online-Ressource |b cr |2 rdacarrier | ||
| 500 | |a Gesehen am 04.12.2020 | ||
| 520 | |a Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). Objective: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. Methods: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. Results: In total, n=22,405 seed URLs with country-code top level domains de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines. | ||
| 650 | 4 | |a classification | |
| 650 | 4 | |a distributed system | |
| 650 | 4 | |a health information | |
| 650 | 4 | |a information | |
| 650 | 4 | |a internet | |
| 650 | 4 | |a search | |
| 650 | 4 | |a web crawling | |
| 700 | 1 | |a Wetter, Thomas |d 1953- |e VerfasserIn |0 (DE-588)141236124 |0 (DE-627)703920774 |0 (DE-576)322863252 |4 aut | |
| 700 | 1 | |a Pfeifer, Daniel |e VerfasserIn |0 (DE-588)1222752948 |0 (DE-627)1742041809 |4 aut | |
| 773 | 0 | 8 | |i Enthalten in |t Journal of medical internet research |d Richmond, Va. : Healthcare World, 1999 |g 22(2020,7) Artikel-Nummer e17853, 22 Seiten |h Online-Ressource |w (DE-627)324614136 |w (DE-600)2028830-X |w (DE-576)281198233 |x 1438-8871 |7 nnas |a Crawling the german health web exploratory study and graph analysis |
| 773 | 1 | 8 | |g volume:22 |g year:2020 |g number:7 |g extent:22 |a Crawling the german health web exploratory study and graph analysis |
| 856 | 4 | 0 | |u https://doi.org/10.2196/17853 |x Resolving-System |x Verlag |3 Volltext |
| 951 | |a AR | ||
| 992 | |a 20201204 | ||
| 993 | |a Article | ||
| 994 | |a 2020 | ||
| 998 | |g 141236124 |a Wetter, Thomas |m 141236124:Wetter, Thomas |d 910000 |d 999701 |e 910000PW141236124 |e 999701PW141236124 |k 0/910000/ |k 1/910000/999701/ |p 2 | ||
| 998 | |g 1222745283 |a Zowalla, Richard |m 1222745283:Zowalla, Richard |d 50000 |e 50000PZ1222745283 |k 0/50000/ |p 1 |x j | ||
| 999 | |a KXP-PPN1742035108 |e 381729137X | ||
| BIB | |a Y | ||
| SER | |a journal | ||
| JSO | |a {"recId":"1742035108","language":["eng"],"type":{"bibl":"article-journal","media":"Online-Ressource"},"note":["Gesehen am 04.12.2020"],"title":[{"title_sort":"Crawling the german health web","subtitle":"exploratory study and graph analysis","title":"Crawling the german health web"}],"person":[{"display":"Zowalla, Richard","roleDisplay":"VerfasserIn","role":"aut","family":"Zowalla","given":"Richard"},{"family":"Wetter","given":"Thomas","roleDisplay":"VerfasserIn","display":"Wetter, Thomas","role":"aut"},{"given":"Daniel","family":"Pfeifer","role":"aut","roleDisplay":"VerfasserIn","display":"Pfeifer, Daniel"}],"relHost":[{"physDesc":[{"extent":"Online-Ressource"}],"id":{"issn":["1438-8871"],"zdb":["2028830-X"],"eki":["324614136"]},"origin":[{"publisherPlace":"Richmond, Va.","dateIssuedDisp":"1999-","publisher":"Healthcare World","dateIssuedKey":"1999"}],"language":["eng"],"recId":"324614136","disp":"Crawling the german health web exploratory study and graph analysisJournal of medical internet research","type":{"bibl":"periodical","media":"Online-Ressource"},"titleAlt":[{"title":"JMIR"}],"part":{"extent":"22","volume":"22","text":"22(2020,7) Artikel-Nummer e17853, 22 Seiten","issue":"7","year":"2020"},"pubHistory":["1.1999 -"],"title":[{"title":"Journal of medical internet research","subtitle":"international scientific journal for medical research, information and communication on the internet ; JMIR","title_sort":"Journal of medical internet research"}]}],"physDesc":[{"extent":"22 S."}],"id":{"eki":["1742035108"],"doi":["10.2196/17853"]},"origin":[{"dateIssuedKey":"2020","dateIssuedDisp":"24.07.2020"}],"name":{"displayForm":["Richard Zowalla, MSc; Thomas Wetter, Dr rer nat, Dipl Math; Daniel Pfeifer, Dr Ing"]}} | ||
| SRT | |a ZOWALLARICCRAWLINGTH2407 | ||