|DBLP||contains title of the papers belonging from SIGMOD,ICSE and DBLP conferences.|
|Aminer||Contains title and abstract of the scientific papers along with the citation graph||Link|
|Arxiv||contains full text of papers in the from of tex source code (Downloading)||Link|
First, we remove all the stop words from the data. The stop word list contains NLTK stop words and sklearn stop words
merged together. After removing them, then the data is passed through the stemmer using PorterStemmer
which truncates the words into its original form and tokenizes them.
All the code is available in the Code section
|Labels||#Papers||Vocabulary Size||Avg Text Length||#Nodes||#links|
Average text length is calculated after tokenizing and stemming the title. The number of nodes and the links are caculated from an undirected graph. In case of the directed graph the number of nodes are reduced to 7713 and total links 34380.
This dataset contains scientific papers from the time frame of 1997 to 2018. Each tar file is named like 1801.01 meaning year 2018 month 01 and number after . means the chunk. The size of the tar file is around 500 MB and total size of the files from 2018 is around 190 GB
First, we tried to generate paperID for each paper. We used filename as the paperID as it has unique name across the whole dataset. Some of these files are present as pdf and we removed them from the dataset. Then we tried to link the paperID with the title so that it can be linked with the other datasets.
The extract_tex script, first removes the pdf files, then it identifies the tar files containing tex file in the archive directory. These tex files are then extracted and moved to another folder containing only the tex files. If files which are already tex files but were compressed are directly moved to the tex folder.