Untitled design (1)

CONTEXT + DATASET

The CONTEXT + DATASET (C+DS) Library is a curated collection of datasets designed specifically for teaching responsible data science. Each dataset is paired with rich contextual information—such as background, stakeholder perspectives, considerations, and intended use—to encourage critical thinking and reflection in data-driven decision-making.

Purpose: To enable faculty to replace outdated or decontextualized training datasets with materials that support intentional, values-driven instruction.

Why It Matters: Responsible data science requires more than technical accuracy—it demands an understanding of why the data exists, who it impacts, and how it should be used. This library provides that missing layer of context to help students meaningfully engage with real-world implications.

Intended Use:

  • Faculty can integrate datasets directly into coursework, case studies, or project-based assignments.
  • Supports learning outcomes related to context-driven value like stakeholder engagement, insights, and critical analysis.
  • Directly context (introductory narrative, role, and business objectives, and products) with datasets (data dictionary, original dataset, provenance, derived datasets)
  • Potential responsible data science opportunities are provided for faculty to leverage is helpful but also to allow for faculty the flexibility to use the dataset in a way that fits through course learning objectives

See examples of how faculty have integrated into their course work here

Explore the Contexts and Datasets

Filter
  • Rising rates of diet-related health issues have led to growing public scrutiny of the fast food industry, particularly the nutritional content of menu items. EatWell Insights, a health-focused food analytics company, partners with public health organizations to evaluate and improve food choices in the fast food sector. In this scenario, learners will work with nutrition data from MenuStat, a publicly available database that tracks menu items and their nutrition information across major fast food chains in the U.S.

  • Declining civic participation, particularly in voting and volunteering, has raised concerns about democratic engagement and social trust in the U.S. Civic Pulse Analytics, a nonprofit research firm, partners with local governments and advocacy groups to analyze civic behaviors and develop targeted engagement strategies.

  • As artificial intelligence becomes more powerful and embedded across industries, policymakers and technologists face growing pressure to establish transparent, accountable, and globally coordinated AI governance frameworks.

  • The U.S. Conservation Reserve Program (CRP), intended to support sustainable farming and soil conservation, is struggling to meet its goals amid declining farmer participation and outdated targeting strategies. AgriPulse Analytics, a nonprofit research group focused on sustainable agriculture policy, provides data-driven insights to inform federal conservation program improvements. Learners will work with USDA Agricultural Census data accessed via an ArcGIS Feature Service, analyzing land use patterns and conservation enrollment across U.S. counties.

  • AI is playing an increasingly central role in disease prediction, helping healthcare providers shift from reactive to preventative care—but concerns remain about accuracy, fairness, and clinical integration. The National Human Genome Research Institute (NHGRI) is exploring how AI models trained on structured health data can assist clinicians in identifying individuals at risk for chronic conditions like heart disease.

  • Pittsburgh’s economy has shifted toward technology and healthcare, but manufacturing remains a critical sector, employing thousands and contributing significantly to regional GDP.

  • The consumer electronics industry faces ongoing competition as brands balance innovation, pricing, and promotional strategies to capture market share

  • News.Health.002 Autism

    The field of public health is grappling with increasing autism spectrum disorder (ASD) prevalence and evolving demographic patterns, underscoring the need for equitable early diagnosis.

  • • Gender-affirming care insurance coverage in the U.S. faces potential rollbacks under new federal health policy shifts. • The policy changes are unfolding amid broader healthcare and LGBTQIA+ equity debates led by a federal agency.

  • • Canada's automotive manufacturing sector is poised for transformation amid growing pressure to expand electric vehicle (EV) production and sales. • The scenario centers on a Canadian automotive manufacturer grappling with evolving industry dynamics, regulatory EV mandates, and shifting consumer demand.

  • The banking industry faces a growing threat from AI-enhanced fraud schemes that bypass traditional detection systems. TrustNet Bank, a mid-sized financial institution, serves both personal and small business custom

  • • The retail industry is undergoing a rapid shift as companies leverage AI and data analytics to optimize store layouts, product placement, and customer experiences. • Lowe’s, a major U.S. home improvement retailer, is using AI, spatial intelligence, and digital twin simulations to redesign store layouts and align inventory with shifting customer trends.

  • In a competitive, data-rich retail landscape, modern analytics empower companies to make rapid, evidence-based decisions for inventory, marketing, and customer engagement.

  • The pharmaceutical industry faces mounting concerns over the safety and oversight of medications produced through global supply chains, especially with growing reliance on Chinese manufacturing. https://en.wikipedia.org/wiki/Pharmaceutical_industry_in_China?utm_source=chatgpt.comhttps://prosperousamerica.org/china-owns-americas-generic-drug-supply-chain-u-s-china-commission-says/?utm_source=chatgpt.comhttps://www.uscc.gov/sites/default/files/2019-11/Chapter%203%20Section%203%20-%20Growing%20U.S.%20Reliance%20on%20China%E2%80%99s%20Biotech%20and%20Pharmaceutical%20Products.pdf?utm_source=chatgpt.com

  • The consumer electronics industry—particularly wearables—faces growing scrutiny as device defects can lead to physical injuries and regulatory penalties.

  • The retail industry faces increasing pressure to optimize supply chains amid shifting consumer spending and growing demand for faster delivery. Target Corporation, a leading U.S. retailer, is investing heavily in expanding regional distribution centers to improve grocery supply chain efficiency and better serve customers.

  • Parkinson’s disease management faces challenges due to the need for frequent clinical assessments, especially for patients with mobility issues or living remotely. Leeds Teaching Hospitals has introduced a smartphone app to facilitate remote symptom tracking and improve patient care. The learner will work with a telemonitoring dataset containing speech recordings and clinical scores from Parkinson’s patients to explore remote symptom monitoring.

  • Esophageal cancer treatment outcomes remain a major challenge in oncology, with perioperative chemotherapy showing promise in improving survival rates.

  • The rapid growth of artificial intelligence has spurred a surge in global venture funding, reshaping the technology and innovation landscape.

  • The rise of sophisticated deepfake technology has increased risks of identity fraud in the financial services industry. A leading financial institution is working to strengthen its fraud detection capabilities to safeguard customer assets and maintain trust.

  • Urban wildlife management faces challenges balancing ecological health with human interaction in public parks. The Central Park Conservancy oversees the care and study of New York City's Central Park, including its diverse animal populations.

  • Motor vehicle fatalities have sharply increased in New York, raising urgent concerns about traffic safety and public health in the transportation industry.

  • Urban forestry plays a critical role in improving city livability and combating climate change within the environmental management industry.

  • Food insecurity remains a pressing social issue within urban public health and community development sectors. Philadelphia-based nonprofits and government agencies are collaborating to address youth hunger and improve access to nutritious foods across the city.

  • The banking industry faces rapid transformation driven by economic shifts, regulatory changes, and technological innovation

  • Childcare accessibility and affordability remain critical issues within the early childhood education and social services sectors. The Pittsburgh Foundation plays a vital role in supporting community initiatives to improve childcare resources and outcomes in the region.

  • This scenario addresses trends in higher education enrollment within the community college industry. The source organization collects and analyzes enrollment data to track shifts and inform institutional planning

  • This scenario focuses on crime data analysis within the public safety and law enforcement industry. The FBI, through its Uniform Crime Reporting (UCR) program, collects nationwide crime statistics to support crime monitoring and policy decisions

  • This scenario addresses youth health and risk behaviors within the public health and education sectors. The CDC administers the Youth Risk Behavior Survey to monitor behaviors contributing to the leading causes of morbidity and mortality among adolescents.

  • This scenario explores disparities in pediatric health outcomes, focusing on inequities faced by children of color within the healthcare and public health sectors. The

  • Public health data reporting varies widely across states, impacting how disease trends are tracked and managed within the healthcare and government data management industry.

  • The recent outbreak of Sudan Ebola virus disease in Uganda highlights critical challenges in infectious disease control within the global public health sector.

  • This scenario addresses the challenges of disaster response and recovery within the humanitarian and geospatial data industries. The Harvard Humanitarian Initiative plays a crucial role by collecting, analyzing, and disseminating geospatial data to support relief efforts in disaster-affected regions.

  • The scenario focuses on the challenge of managing dilapidated and abandoned properties, which pose risks to community safety and strain city resources within the urban housing and municipal services industry

  • In today’s data-driven world, the ability to analyze and interpret data is an essential skill across industries—from healthcare to finance to public policy. However, many high school and community college students face barriers to engaging with data science, including limited access to mentors, uncertainty in developing research questions, and challenges in working with datasets. DataJam was created to bridge this gap, fostering data literacy and critical thinking through real-world problem-solving. To enhance this experience, the Virtual DataJam Helper introduces AI-powered mentors, such as ChatGPT, to guide students through the research process. This tool supports learners in refining their hypotheses, identifying relevant datasets, and navigating data exploration. By embedding responsible data science principles, this initiative empowers students to ask meaningful questions, work with functional data, and develop confidence in using AI tools—preparing them for further education and careers in an increasingly data-centric world.

  • https://www.truthinaccounting.org/news/detail/falling-revenue-soaring-costs-threaten-pittsburghs-financial-future?utm_source=chatgpt.com

  • Affordable housing remains a critical issue in Pittsburgh, with initiatives focused on creating funding solutions and tracking housing stability.

  • The banking industry is undergoing a significant transformation driven by large investments in artificial intelligence and advanced technologies aimed at improving customer experience and operational efficiency.

  • Vacant and tax-delinquent properties pose challenges for Pittsburgh neighborhoods, complicated by outdated ownership records.

  • In the mid-1990s, the Group Insurance Commission (GIC) in Massachusetts released anonymized health records for research purposes, believing that removing explicit identifiers like names would protect individuals' privacy. However, by combining this data with publicly available voter registration information, individuals could be re-identified, exposing sensitive health information.

  • In 2006, AOL published a dataset containing 20 million search queries from 650,000 users, intending it for research use. Although the data was anonymized by replacing usernames with unique identification numbers, reporters from The New York Times were able to identify individuals based on the content of their search queries, revealing personal and sensitive information.

  • In 2006, Netflix released an anonymized dataset of movie ratings from approximately 500,000 subscribers as part of a public machine learning competition. Researchers Arvind Narayanan and Vitaly Shmatikov demonstrated that by cross-referencing this dataset with non-anonymous IMDb user reviews, they could re-identify specific individuals, uncovering their personal viewing histories.

  • During the 2010 Haiti earthquake, researchers utilized anonymized mobile phone data to track population movements. This information was crucial in understanding the spread of cholera and coordinating effective humanitarian responses. By analyzing call patterns, researchers could estimate the movement of people both in response to the earthquake and during the related cholera outbreak, enabling rapid and accurate interventions.

  • During the Ebola outbreak in West Africa, telecommunications companies shared aggregated mobile data to help track and predict the spread of the virus. This collaboration allowed health organizations to allocate resources more effectively and implement targeted containment strategies.

  • The manufacturing and industrial sector is increasingly embracing digital transformation to enhance productivity and innovation. General Electric (GE) is undergoing a company-wide digital overhaul to optimize operations and product development.

  • Digital marketing firms increasingly rely on user engagement data to optimize ad performance, but without observability into how data flows and changes across systems, ad targeting can become inefficient or biased.

  • A mysterious “coding issue” in Equifax’s AI-driven credit scoring system led to consumers receiving inaccurate credit scores in early 2022, illustrating risks within the financial services and data infrastructure industry.

  • The Bureau of Labor Statistics runs the Quarterly Census of Employment and Wages program, which collects and publishes statistics on the number of jobs, average weekly wages, and total wages aggregated by granular industry type, U.S. county, category of owner, and size of business by employee headcount. ((Some options for the narrative: Which industries are seeing a pattern of sustained growth in number of employees and average wages over the last five years? For those in the industry, this analysis could help them keep salaries competitive in the interest of employee retention. For those in local government, these trends could help them decide which industries to bring to their area to boost employment. Picking a given industry and looking at where the growth is strongest could help a job seeker decide where to move to.))

  • The available data table provides the dates of mortgage foreclosure filings in Allegheny County, the parcel identifiers (which allow linking to many other records, including assessed property value, sales history, and the nature of the property (e.g., home vs. apartment building vs. business location)), the amount that the lender is suing the property owner for, and the name of the lender. This data provides a history of foreclosure attempts from January of 2009 to the present day. ((Narratives to avoid: “Determine risk of mortgage based on various aggregate property attributes.” That way lies discriminatory redlining!))

  • Every year, the FDIC publishes an inventory of all U.S. banks and branches. At the institution level (aggregating over all branches), the data includes whether the bank is FDIC insured, the total assets of the bank, the amount of deposits in checking accounts (demand accounts), and the amount of deposits in time (e.g., CDs) and savings accounts. At the branch level, this data includes the address and geocoordinates of the branch, the date that the branch was established, and the the total deposits at that branch. ((This data could support historical analyses, going back to 1994 of the number, locations, and holdings of bank branch offices. As routine bank operations such as withdrawing cash or depositing checks can be increasingly done without going to branches, the demand for branch offices has decreased and the total number of branches in Allegheny County has fallen by 25% over the last 30 years. One option here is to use Census data to predict population shifts over the next 10 years and then, taking the role of one of the banks with the most branches, decide how to redistribute branch offices to best serve Allegheny County’s population.))

  • Bella and her team at Healthcare Insurance are evaluating a new dataset that contains comprehensive patient and insurance payment information. This dataset is considered crucial for enhancing the company's predictive analytics models, which forecast healthcare trends, personalize insurance plans, and optimize claim processing. Bella Ramirez, Procurement Team Lead (15 years of experience in the healthcare industry, specializing in procurement) with the responsibilities of Evaluating and acquiring high-quality datasets to improve the company's analytical models. Facilitating vendor reviews and ensuring all datasets comply with applicable provenance requirements, including metadata coverage, regulatory requirements, and transparent AI data usage. Ensuring that procured data meets integration operational needs. Ensuring data usage aligns with healthcare regulations and company policies. Leveraging data insights for innovative marketing and improved customer trust.

  • Jordan Liu's current project involves curating a dataset that tracks media consumption habits across diverse platforms. This dataset aims to empower media buyers and sellers in accurately targeting their audience segments, facilitating personalized content strategies for industries ranging from consumer goods to tourism. Jordan Liu, Data Strategy Director, with a a decade of experience in media analytics and a deep understanding of the media consumption landscape. His responsibilities include Overseeing the development and distribution of comprehensive media consumption datasets. Ensuring datasets adhere to the latest data provenance standards for transparent AI data usage and are relevant and reliable. Collaborating with stakeholders across healthcare, consumer goods, and travel industries to tailor data offerings. Guiding the integration of datasets into client systems to optimize targeted content delivery and marketing strategies. Advocating for data-driven decision-making within the company and among clients to foster industry innovation.

  • Minh is tasked with evaluating a new dataset for refining AI algorithms for customer credit card offerings. The dataset under consideration has been documented in accordance with the latest data provenance standards, ensuring transparency and compliance, especially under GDPR and the EU AI Act. Minh's evaluation process focuses on the detailed metadata provided for the dataset. Minh Quang Nguyen, Data Architecture and Policy Analyst with over a decade of experience in data management and policy development. This role’s responsibilities include Designing and implementing efficient data architectures that support ProForma’s business goals. Work closely with IT teams to ensure that data structures are scalable, secure, and optimized for performance. Play a crucial role in developing and enforcing data management policies, ensuring compliance with regulatory standards and protecting customer information.

  • The global nature of Navisphere Logistics, Ltd.'s operations means that the company must navigate a complex web of international tariffs and customs regulations. Efficiently managing these tariffs is critical to minimizing delivery times and costs. Dr. Hicks and her team are tasked with refining the company’s AI systems to accurately predict tariff costs across different countries and product categories. Dr. Maya Hicks, Lead Data Scientist with a specialization in artificial intelligence and machine learning and a keen interest in optimizing supply chain efficiency through innovative technologies. Her responsibilities are to Lead the AI research and development team in refining and enhancing the company's AI-driven tariff prediction models. Evaluate datasets for integrity and compliance with corporate policies and standards which reflect international regulations and privacy considerations. Collaborate with procurement and legal colleagues to ensure that the data and AI models are in line with global standards and regulations. Train and optimize AI models to accurately predict tariffs, involving sophisticated algorithms and machine learning techniques. Integrate the refined AI models into Navisphere Logistics' operational systems and conduct extensive testing to ensure accuracy and efficiency. Establish and maintain a feedback loop for continuous monitoring and improvement of the AI models based on real-world application insights. Ensure the responsible use of AI in accordance with Navisphere Logistics’ standards and privacy laws, particularly in the handling of sensitive data. Communicate the progress and outcomes of the AI enhancements to stakeholders, including technical teams, management, and commercial clients. Stay updated with the latest developments in AI, machine learning, and international logistics practices to continually drive innovation within the company.