Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcm.arrow_strength providing different ranking #1130

Closed
ankur-tutlani opened this issue Jan 8, 2024 · 6 comments
Closed

gcm.arrow_strength providing different ranking #1130

ankur-tutlani opened this issue Jan 8, 2024 · 6 comments
Labels
question Further information is requested stale

Comments

@ankur-tutlani
Copy link

I am using arrow_strength function to identify top nodes showing variation in target node (Growth).

arrow_strengths = gcm.arrow_strength(scm, target_node='Growth',num_samples_conditional=5000,difference_estimation_func=gcm.divergence.estimate_kl_divergence_continuous_knn)
arrow_strength_pd = pd.DataFrame(list(arrow_strengths.items()), columns=['edge', 'importance'])
arrow_strength_pd=arrow_strength_pd.sort_values('importance',ascending=False)

There are ~ 40 nodes. After sorting I am getting different answers, say if X node is ranked on 10th, in another iteration using same causal graph and data, it moves to 30th place or vice versa. Is this behavior expected? Does this depend on causal graph structure?

Version information:

  • DoWhy version [e.g. 0.11.1]
@ankur-tutlani ankur-tutlani added the question Further information is requested label Jan 8, 2024
@bloebp
Copy link
Member

bloebp commented Jan 8, 2024

The arrow strength has some sampling for estimation which leads to variations between runs. You can reduce this by changing some parameters like tolerance (to a smaller number).

Generally, if the rankings change that much between runs, it seems the connections are either equally strong or too weak in general (or the model simply isn't capturing them accurately enough). What is the range of the values?

You can also take a look at estimating confidence intervals, they might provide better insights:
https://www.pywhy.org/dowhy/v0.11.1/user_guide/modeling_gcm/estimating_confidence_intervals.html#conveniently-bootstrapping-graph-training-on-random-subsets-of-training-data

@ankur-tutlani
Copy link
Author

Thanks for sharing the link and this is helpful.
Is there any recommendation in the library on the following?

  1. If the causal graph structure is not very certain. The "auto" option takes care of causal mechanisms, but is there anything similar for graph too?
  2. What are the recommendations to improve this if we get say following evaluation result?

The overall average KL divergence between the generated and observed distribution is 0.6444021604490836
The estimated KL divergence indicates a good representation of the data distribution, but might indicate some smaller mismatches between the distributions.

Copy link

This issue is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Jan 26, 2024
@bloebp
Copy link
Member

bloebp commented Jan 26, 2024

Sorry for the late reply!

1. If the causal graph structure is not very certain. The "auto" option takes care of causal mechanisms, but is there anything similar for graph too?

You can take a look at https://github.com/py-why/causal-learn, this is a package for inferring the causal graph based on data.

2. What are the recommendations to improve this if we get say following evaluation result?

You could try and set the parameter for the quality in the auto assignment function to BETTER (see the docstring of the function). Let me know if this improves the results (i.e., lower KL divergence). Otherwise, you might need to manually check which causal mechanisms can be improved. Maybe the performance results of nodes can give some insights.

@github-actions github-actions bot removed the stale label Jan 27, 2024
Copy link

This issue is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale label Feb 10, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

2 participants