MAG Depth Uniqueness & Interdisciplinarity

Overview

In this repository you can find the algorithm for the Depth Uniqueness & Interdisciplinarity metrics calculation for Microsoft Academic Graph (MAG).

The metrics are calculated at 26GB /scratch/aal544/AndriyMetrics/AndriyMetrics.csv. The table includes the metrics for 152M papers. bka3 has root permissions to the directory / file. Expect the code re-runs to run 10-12 hours.

Files

You will need these files from the MAG for the calculation.

46G PaperFieldsOfStudy.txt
71G Papers.txt
59M FieldsOfStudy.txt
18M FieldOfStudyChildren.txt

Metrics

We are dealing with the following metrics:

1. Uniqueness

The metric is a tuple: (new_field_2combinations, field_count). The field_count is how many fields in total this paper has. The new_field_2combinations is a counter of how many unique 2-pairs of fields this paper introduces comparing to all the papers that were published up to the field's piublication year.

This way the first combination of feild1+field2 will increment the new_field_2combinations value. If a paper has 5 fileds {f1,f2,f3,f4,f5}, it's the first time f5 is introduced and the other 4 fields had been published together, the value of new_field_2combinations would be nCr(5,2). In other case, for instance, if all fields except for field2 and field5 had appeared together, the value of new_field_2combinations would be 1 because it only introduces 1 new unseen combination.

P.S. We cannot interpret “counter” as “innovation” or “novelty”. This is because one may argue that p can be innovative or novel compared to other papers that preceded it, even if all of them had the same vector as p and were published before p (e.g., if p is novel in the way it solved a particular problem, rather than being novel in the topics it studies). In contrast, interpreting “counter” as “uniqueness” is harder to argue against.

2. Interdisciplinarity

Measure interdisciplinarity in a way similar to lexicographic ordering.

In MAG there are 6 levels of fields: 19 parent fields (Math, Physics, Biology, etc.), 100+ first children (AI, Astronomy, ML, etc), and so on. We can find the vector of counts of field levels per paper v = [l0,l1,l2,l3,l4,l5].

The most interdisciplinary papers are those whose 1st value in vector is greatest.
- Out of those, the most interdisciplinary are those whose 2nd value is greatest
- Out of those, the most interdisciplinary are those whose 3rd value is greatest
- And so on…
Then, we have those whose 2nd value in vector is greatest.
- Out of those, the most interdisciplinary are those whose 3rd value is greatest
- Out of those, the most interdisciplinary are those whose 4th value is greatest
- And so on…

3. Depth

This is the index of the last non-zero value in the counts of field levels per paper v = [l0,l1,l2,l3,l4,l5] vector.

Since the field levels are hierarchical in MAG, the lower the field is — the more specific in relation to the science it is.

Algorithm

The code for computing the metrics is available here in the repository. It is well commented and segmented.

Synopsis:

Global Vars
- ENV can be set to test or HPC for local and production execution.
- FIELD_CONFIDENCE is the threshold for the MAG certainty of the field per paper. It is >50% by default.
Paths
- Set the path to the parent folder of MAG (it will be the place where data is saved)
- Set the name of the MAG folder
Get the Paper-Field associations
- Group papers and paper fields by PID
- Drop all fields that are below the threshold of certainty
Get the Paper Publication Years
- Merge each paper with its publication year
Extend Fields with the Parent Fields
- Run BFS on fields that we have to propagate up and note all parent fields. For instance, if Eigen Decomposition is a field we add its parent Linear Algebra and its parent Math to the fields of the paper.
Count Fields per Level
- Calculate the v vector of field counts per level per paper.
Calculate Uniqueness
- Run a linear scan and update the tuple counts every year.
- For every field set per paper, find all the 2 combinations of the fields and keep track of which ones appear for the first time.
Get Depth and Interdisciplinarity
- Convert the v vector into a scaled value, and standardize the distribution to keep it in bounds.
- Save the index of the last non-zero value of the v vector.
Save the Metrics
- save the file to path

Data Sample

PID	PaperFields	PubYear	LevelCounts	New_Tuples	Field_Count	Depth	Interdisciplinarity
3483532	{199539241, 190136086, 111472728, 17744445, 138885662}	1825	[2, 2, 1, 0, 0, 0]	10	5	2	-0.6337096715235878
152588939	{71924100, 141071460, 2780401607, 86803240, 151730666, 127313418, 105702510, 2780193326, 2779777117}	1884	[3, 3, 2, 1, 0, 0]	36	9	3	0.05073388992405839
134480136	{111472728, 2780349523, 138885662}	1893	[1, 1, 1, 0, 0, 0]	2	3	2	-1.3179426349523269
76015792	{71924100, 2778536324, 141071460, 86803240, 2778722699, 105702510}	1904	[2, 2, 2, 0, 0, 0]	9	6	2	-0.6335027045050068
118077477	{54355233, 24107716, 185592680, 86803240, 55493867}	1906	[2, 2, 1, 0, 0, 0]	10	5	2	-0.6337096715235878
173670722	{2780550144, 50522688, 199539241, 162324750, 17744445}	1914	[2, 2, 1, 0, 0, 0]	9	5	2	-0.6337096715235878
114636826	{185592680, 178790620, 2777517455}	1918	[1, 1, 1, 0, 0, 0]	3	3	2	-1.3179426349523269
58810875	{2524010, 2781425163, 33923547}	1919	[1, 1, 1, 0, 0, 0]	3	3	2	-1.3179426349523269

Usage

If you intend on using the PaperFields set, use {'PaperFields':literal_eval}, but it is slow. Otherwise, just read it in as a string.

pd.read_csv(path + filename, usecols=['PID','PaperFields', 'PubYear'], converters={'PaperFields':literal_eval})

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAG Depth Uniqueness & Interdisciplinarity

Table of Contents

Overview

Files

Metrics

1. Uniqueness

2. Interdisciplinarity

3. Depth

Algorithm

Data Sample

Usage

About

Releases

Packages

Languages

moon1ock/AndriyMetrics

Folders and files

Latest commit

History

Repository files navigation

MAG Depth Uniqueness & Interdisciplinarity

Table of Contents

Overview

Files

Metrics

1. Uniqueness

2. Interdisciplinarity

3. Depth

Algorithm

Data Sample

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages