Paco Nathan

Known as a “player/coach”, with core expertise in data science, natural language processing, machine learning, cloud computing; 35+ years tech industry experience, ranging from Bell Labs to early-stage start-ups. Co-chair Rev and JupyterCon. Advisor for NYU Coleridge Initiative, IBM Data Science Community, Amplify Partners, Recognai, Primer. Formerly: Director, Community Evangelism @ Databricks and Apache Spark. Cited in 2015 as one of the Top 30 People in Big Data and Analytics by Innovation Enterprise.

2021 Talk: Graph-Based Data Science: `kglab` open source integration of graph libraries with popular data science tooling

Python offers excellent libraries for working with graphs: semantic technologies, graph queries, interactive visualizations, graph algorithms, probabilistic graph inference, as well as embedding and other integrations with deep learning. However, most of these approaches share little common ground, nor do many of them integrate effectively with popular data science tools (pandas, scikit-learn, spaCy, PyTorch), nor efficiently with popular data engineering infrastructure such as Spark, RAPIDS, Ray, Parquet, fsspect, etc. This talk reviews `kglab` – an open source project that integrates most all of the above, and moreover provides ways to leverage disparate techniques in ways that complement each other, to produce Hybrid AI solutions for industry use cases.

2020 Talk: Rich Context: a knowledge graph for linking datasets with research outcomes

The Rich Context project at NYU Wagner is the knowledge graph complement to the ADRF platform for cross-agency social science research using sensitive data, currently used by 50+ agencies. Rich Context represents metadata about datasets and their use in research which in turn influences public policy, with a goal of producing recommender systems for analysts and policymakers. Most all of the code is open source. This talk introduces the background for the project, our team process for collaboration, and several areas where machine learning is used to infer or clean metadata obtained from scholarly infrastructure and for semi-automated graph construction, along with human-in-the-loop feedback mechanisms for domain experts to help improve our graph.

View the complete 2020 talk in the KGC media library.