Classifying unstructured data into natural language text and technical information

Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. However, data mining is often only interested in one of the two types e.g. in natural language text wh...

Full description

Saved in:
Bibliographic Details
Main Authors: Merten, Thorsten (Author) , Paech, Barbara (Author)
Format: Chapter/Article Conference Paper
Language:English
Published: 2014-05-31
In: Proceedings of the 11th Working Conference on Mining Software Repositories
Year: 2014, Pages: 300-303
DOI:10.1145/2597073.2597112
Online Access:Resolving-System, Volltext: http://dx.doi.org/10.1145/2597073.2597112
Verlag, Volltext: https://dl.acm.org/citation.cfm?id=2597112
Get full text
Author Notes:Thorsten Merten, Bastian Mager, Simone Bürsner, Barbara Paech

MARC

LEADER 00000caa a2200000 c 4500
001 1578057817
003 DE-627
005 20220814203258.0
007 cr uuu---uuuuu
008 180730s2014 xx |||||o 00| ||eng c
024 7 |a 10.1145/2597073.2597112  |2 doi 
035 |a (DE-627)1578057817 
035 |a (DE-576)508057817 
035 |a (DE-599)BSZ508057817 
035 |a (OCoLC)1341014613 
040 |a DE-627  |b ger  |c DE-627  |e rda 
041 |a eng 
084 |a 28  |2 sdnb 
100 1 |a Merten, Thorsten  |e VerfasserIn  |0 (DE-588)1125392533  |0 (DE-627)879923849  |0 (DE-576)483390518  |4 aut 
245 1 0 |a Classifying unstructured data into natural language text and technical information  |c Thorsten Merten, Bastian Mager, Simone Bürsner, Barbara Paech 
264 1 |c 2014-05-31 
300 |a 4 
336 |a Text  |b txt  |2 rdacontent 
337 |a Computermedien  |b c  |2 rdamedia 
338 |a Online-Ressource  |b cr  |2 rdacarrier 
500 |a Gesehen am 30.07.2018 
520 |a Software repository data, for example in issue tracking systems, include natural language text and technical information, which includes anything from log files via code snippets to stack traces. However, data mining is often only interested in one of the two types e.g. in natural language text when looking at text mining. Regardless of which type is being investigated, any techniques used have to deal with noise caused by fragments of the other type i.e. methods interested in natural language have to deal with technical fragments and vice versa. This paper proposes an approach to classify unstructured data, e.g. development documents, into natural language text and technical information using a mixture of text heuristics and agglomerative hierarchical clustering. The approach was evaluated using 225 manually annotated text passages from developer emails and issue tracker data. Using white space tokenization as a basis, the overall precision of the approach is 0.84 and the recall is 0.85. 
650 4 |a heuristics 
650 4 |a hierarchical clustering 
650 4 |a mining software repositories 
650 4 |a preprocessing 
650 4 |a unstructured data 
700 1 |a Paech, Barbara  |d 1959-  |e VerfasserIn  |0 (DE-588)172299799  |0 (DE-627)697208648  |0 (DE-576)133166821  |4 aut 
773 0 8 |i Enthalten in  |a Devanbu, Premkumar  |t Proceedings of the 11th Working Conference on Mining Software Repositories  |d New York, NY : ACM, 2014  |g (2014), Seite 300-303  |h 1 online resource (427 pages)  |w (DE-627)1657499197  |w (DE-576)506477894  |z 9781450328630  |7 nnam 
773 1 8 |g year:2014  |g pages:300-303  |g extent:4  |a Classifying unstructured data into natural language text and technical information 
856 4 0 |u http://dx.doi.org/10.1145/2597073.2597112  |x Resolving-System  |x Verlag  |3 Volltext 
856 4 0 |u https://dl.acm.org/citation.cfm?id=2597112  |x Verlag  |3 Volltext 
951 |a AR 
992 |a 20180730 
993 |a ConferencePaper 
994 |a 2014 
998 |g 172299799  |a Paech, Barbara  |m 172299799:Paech, Barbara  |d 110000  |d 110300  |e 110000PP172299799  |e 110300PP172299799  |k 0/110000/  |k 1/110000/110300/  |p 4  |y j 
999 |a KXP-PPN1578057817  |e 3019728290 
BIB |a Y 
JSO |a {"relHost":[{"title":[{"title":"Proceedings of the 11th Working Conference on Mining Software Repositories","title_sort":"Proceedings of the 11th Working Conference on Mining Software Repositories"}],"disp":"Devanbu, PremkumarProceedings of the 11th Working Conference on Mining Software Repositories","type":{"media":"Online-Ressource","bibl":"book"},"person":[{"display":"Devanbu, Premkumar","roleDisplay":"VerfasserIn","role":"aut","given":"Premkumar","family":"Devanbu"}],"recId":"1657499197","corporate":[{"role":"isb","roleDisplay":"Herausgebendes Organ","display":"Association for Computing Machinery"}],"language":["eng"],"physDesc":[{"extent":"1 online resource (427 pages)"}],"part":{"year":"2014","extent":"4","pages":"300-303","text":"(2014), Seite 300-303"},"origin":[{"dateIssuedDisp":"2014","dateIssuedKey":"2014","publisher":"ACM","publisherPlace":"New York, NY"}],"id":{"eki":["1657499197"],"doi":["10.1145/2597073"],"isbn":["9781450328630"]}}],"id":{"doi":["10.1145/2597073.2597112"],"eki":["1578057817"]},"origin":[{"dateIssuedKey":"2014","dateIssuedDisp":"2014-05-31"}],"note":["Gesehen am 30.07.2018"],"person":[{"family":"Merten","given":"Thorsten","role":"aut","roleDisplay":"VerfasserIn","display":"Merten, Thorsten"},{"roleDisplay":"VerfasserIn","role":"aut","display":"Paech, Barbara","family":"Paech","given":"Barbara"}],"physDesc":[{"extent":"4 S."}],"language":["eng"],"recId":"1578057817","type":{"media":"Online-Ressource","bibl":"chapter"},"name":{"displayForm":["Thorsten Merten, Bastian Mager, Simone Bürsner, Barbara Paech"]},"title":[{"title_sort":"Classifying unstructured data into natural language text and technical information","title":"Classifying unstructured data into natural language text and technical information"}]} 
SRT |a MERTENTHORCLASSIFYIN2014