ai-alignment

Here are 35 public repositories matching this topic...

MinghuiChen43 / awesome-trustworthy-deep-learning

A curated list of trustworthy deep learning papers. Daily updating...

Updated Nov 12, 2024

agencyenterprise / PromptInject

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

lets-make-safe-ai / make-safe-ai

Star

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

ai agi artificial-intelligence artificial-general-intelligence ai-safety ai-alignment

Updated Mar 29, 2023

Giskard-AI / awesome-ai-safety

Sponsor

Star

📚 A curated list of papers & technical articles on AI Quality & Safety

Updated Oct 13, 2023

EzgiKorkmaz / adversarial-reinforcement-learning

Star

Reading list for adversarial perspective and robustness in deep reinforcement learning.

deep-reinforcement-learning ai-safety adversarial-machine-learning multiagent-reinforcement-learning robust-machine-learning ai-alignment safe-reinforcement-learning robust-reinforcement-learning responsible-ai adversarial-reinforcement-learning meta-reinforcement-learning explainable-machine-learning adversarial-policies safe-rlhf machine-learning-safety reinforcement-learning-safety artificial-intelligence-alignment reinforcement-learning-alignment robust-deep-reinforcement-learning

Updated Jun 18, 2024

dit7ya / awesome-ai-alignment

Star

A curated list of awesome resources for Artificial Intelligence Alignment research

awesome awesome-list ai-safety ai-alignment

Updated Jul 14, 2023

AthenaCore / AwesomeResponsibleAI

Star

A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustworthy, and Human-Centered AI.

ai awesome-list ai-safety interpretable-ai explainable-ai xai ai-alignment fairness-ai responsible-ai ethical-ai trustworthy-ai ai-regulation ai-governance ai-standards

Updated Nov 16, 2024

RLHFlow / Directional-Preference-Alignment

Star

Directional Preference Alignment

ai-alignment large-language-models rlhf

Updated Sep 23, 2024

wesg52 / sparse-probing-paper

Star

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

riceissa / aiwatch

Star

Website to track people, organizations, and products (tools, websites, etc.) in AI safety

mysql php database dataset ai-safety data-portal aisafety ai-alignment

Updated Nov 17, 2024
HTML

UCSC-VLAA / Sight-Beyond-Text

Star

This repository includes the official implementation of our paper "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"

alignment vlm ai-alignment vision-language vicuna llm mllm llava llama2

Updated Sep 15, 2023
Python

liondw / Signal-Alignment

Star

An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.

education design ai ai-alignment

Updated Jul 9, 2023

phelps-sg / llm-cooperation

Sponsor

Star

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

economics ai-safety gametheory experimental-economics behavioral-economics prisoners-dilemma ai-alignment experimental-psychology social-dilemmas gpt-3 gpt-4 llm principal-agent-problem

Updated Mar 1, 2024
Python

IQTLabs / daisybell

Star

Scan your AI/ML models for problems before you put them into production.

cybersecurity ai-safety bias-correction bias-detection ai-alignment model-poison ai-assurance

Updated Nov 16, 2024
Python

ai-fail-safe / safe-reward

Star

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

failsafe ai-safety anomaly-detection ai-alignment fail-safe