Date of Award
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
GitHub has become one of the most popular online software developing website. I have crawled the most popular software repositories (own over 500 star number) in GitHub, along with their contributors and stargazers. In total, we have crawled 10,665 repositories, 176,256 contributors, and 1,170,449 stargazers. One of the most important missions of analyzing is detecting communities from the network. While the heterogeneous Github network includes three objects, user, repository and pro- gramming languages and two kinds of relation between user and repository, i.e., star and contribute. Mining heterogeneous information network is a fresh and promising research field in data mining. A lot of algorithms has been proposed for heteroge- neous network clustering. However, most of these methods directly cluster the het- erogeneous networks. This thesis aims to transform the heterogeneous network to the homogeneous network using different schemes and then cluster the new network. We studied three weighting schemes, including dot product, Jaccard similarity and cosine similarity between the vector representations of objects. Then I cluster the homoge- neous network by using modularity maximization optimization algorithms, in particu- lar, greedy modularity maximization optimization algorithm and spectral modularity maximization optimization algorithm. The performance of clustering is evaluated using F-measure and rand index based on the programming language the software repository used. To compare the interaction between the weighting schemes and clus- tering algorithms, we applied out methods on GitHub dataset. Then we transformed the whole network to repository-repository and furthermore transformed it to the language-repository network. Based on this network, we discovered the relation be- tween languages. Among 94 programming languages used by the top 10,000 projects, we studied their relations using several clustering methods. Overall, we find that lan- guages fall into five communities, i.e., web and scripting languages (JavaScrip, HTML, etc.), system programming languages (C, C++, etc.), OS X and IOS programming languages (Objective-C, Swift, etc.), numerical and statistical languages (Matlab, FORTRAN, Julia and R), and functional programming (Lisp, Scheme, etc.).
Zhang, Zhongpei, "Crawling and Analyzing Repository in GitHub" (2016). Electronic Theses and Dissertations. 5879.