Dataset Overview

The following table gives information about the dataset that I used in my research
Dataset Name Description Source
DBLP contains title of the papers belonging from SIGMOD,ICSE and DBLP conferences.
Aminer Contains title and abstract of the scientific papers along with the citation graph Link
Arxiv contains full text of papers in the from of tex source code (Downloading) Link

Data Preprocessing

First, we remove all the stop words from the data. The stop word list contains NLTK stop words and sklearn stop words merged together. After removing them, then the data is passed through the stemmer using PorterStemmer which truncates the words into its original form and tokenizes them.
All the code is available in the Code section

DBLP

This dataset contains paper from three conferences

DBLP Statistics

Labels #Papers Vocabulary Size Avg Text Length #Nodes #links
Sigmod 2983 2760 6.2 2983 27739
VLDB 3278 2676 6.4 3278 29767
ICSE 3135 2600 6.6 3135 11254
Total 9396 5092 6.4 9396 68760

Average text length is calculated after tokenizing and stemming the title. The number of nodes and the links are caculated from an undirected graph. In case of the directed graph the number of nodes are reduced to 7713 and total links 34380.

Format of Dataset

The data format for the citation graph is
PaperID \t PaperID
and the format for for the data is
PaperID \t title

Arxiv Dataset

This dataset contains scientific papers from the time frame of 1997 to 2018. Each tar file is named like 1801.01 meaning year 2018 month 01 and number after . means the chunk. The size of the tar file is around 500 MB and total size of the files from 2018 is around 190 GB

First, we tried to generate paperID for each paper. We used filename as the paperID as it has unique name across the whole dataset. Some of these files are present as pdf and we removed them from the dataset. Then we tried to link the paperID with the title so that it can be linked with the other datasets.

The extract_tex script, first removes the pdf files, then it identifies the tar files containing tex file in the archive directory. These tex files are then extracted and moved to another folder containing only the tex files. If files which are already tex files but were compressed are directly moved to the tex folder.