Week 3: Prompting and Bias

(see bottom for assigned readings and questions)

Prompt Engineering (Week 3)

Presenting Team: Haolin Liu, Xueren Ge, Ji Hyun Kim, Stephanie Schoch

Blogging Team: Aparna Kishore, Erzhen Hu, Elena Long, Jingping Wan

(Monday, 09/11/2023) Prompt Engineering

Warm-up questions

Monday’s class started with warm-up questions to demonstrate how prompts can help an LLM produce correct answers or desired outcomes. The questions and the prompts were tested in GPT3.5. This task was performed as an in-class experiment where each individual used GPT3.5 to test the questions and help GPT3.5 produce correct answers via prompts.

The three questions were:

  1. What is 7084 times 0.99?
  2. I have a magic box that can only transfer coins. If you insert a number of coins in it, the next day each coin will turn into two apples. If I add 10 coins and wait for 3 days, what will happen?
  3. Among “Oregon, Virginia, Wyoming”, what is the word that ends with “n”?

While the first question tested the arithmetic capability of the model, the second and the third questions tested common sense and symbolic reasoning, respectively. The initial response from GPT3.5 for all three questions was wrong.

For the first question, providing more examples as prompts did not work. At the same time, an explanation of how to reach the specific answer by decomposing the multiplication into multiple steps helped.

Figure 1 shows the prompting for the first question and the answer from GPT3.5.

Figure 1: Prompting for arithmetic question

For the second question, providing an example and an explanation behind the reasoning on how to reach the final answer helped GPT produce the correct answer. Here, the prompt included explicitly stating that the magic box can also convert from coins to apples.

Figure 2 shows the prompting for the second question and the answer from GPT3.5.

Figure 2: Prompting for common sense question

While GPT was producing random results for the third question, instructing GPT through examples to take the words, concatenate the last letters, and then find the alphabet’s position helped produce the correct answer.

Figure 3 shows the prompting for the third question and the answer from GPT3.5.

Figure 3: Prompting for symbolic reasoning question

All these examples demonstrate the benefit of using prompts to explore the model’s reasoning ability.

What is Prompt Engineering?

Prompt engineering is a method to communicate and guide LLM to demonstrate a behavior or desired outcomes by crafting prompts that coax the model towards providing the desired response. The model weights or parameters are not updated in prompt engineering.

How is prompt-based learning different from traditional supervised learning?

Traditional supervised learning trains a model by taking input and generating an output based on prediction probability. The model learns to map input data to specific output labels. In contrast, prompt-based learning models the probability of the text directly. Here, the inputs are converted to textual strings called prompts. These prompts are used to generate desired outcomes. Prompt-based learning offers more flexibility in adapting the model’s behavior to different tasks by modifying the prompts. Retraining the model is not required in this scenario.

Interestingly, the prompts were initially used in language translations and emotion predictions based on texts instead of improving the performance of LLMs.

In-context learning and different types of prompts

In-context learning is a powerful approach to fine-tuning or training the model within a specific context. This improves the performance and reliability of the model for the specific task or the environment. Here, the models are given a few examples as reference/instructions that are relevant to the context and are domain-specific.

We can categorized in-context learning into three different types of prompts:

  • Zero-shot – the model predicts the answers given only a natural language description of the task.

Figure 4: Example for zero-shot prompting (Image Source)

  • One-shot or Few-shot – In this scenario, one or few examples are provided that explains the task description the model, i.e. prompting the model with few input-output pairs.

Figure 5: Examples for one-shot and fewshot prompting (Image Source)

  • Chain-of-thought – The given task or question is decomposed into coherent intermediate reasoning steps that are solved before providing the final response. This explores the reasoning ability of the model for each of the provided tasks. It is given in the format <input chain-of-thought output>. The difference between standard prompting and chain-of-thought prompting is depicted in the figure below. In the figure to the right, the highlighted statement in blue is an example of chain-of-thought prompting, where the reasoning behind reaching a final answer is provided as a part of the example. Thus, in the model outcome, the model also outputs its reasoning, highlighted in green, to reach the final answer. In addition, chain-of-thought prompting can revolutionize the way we interact with LLMs and leverage their capabilities, as they provide step-by-step explanations of how a particular response is reached.

Figure 6: Standard prompting and chain-of-thought prompting (Image Source)

What is the difference between prompts and fine-tuning?

Prompt engineering focuses on eliciting better output for a given LLM through changing input. Fine-tuning focuses on enhancing model performance by training the model on a smaller, targeted database relevant to the desired task. The similarity is that both methods help improve the model’s performance and provide desired outcomes.

Prompt engineering requires no retraining, and the prompting is performed in a single window of the model. At the same time, fine-tuning involves retraining the model and changing the model parameter to improve its performance. Fine-tuning also requires more computational resources compared to prompt engineering.

When is the best to use prompts vs fine-tuning?

The above question was an in-class discussion question, and the discussion points were shared in class. Fine-tuning requires updating model weights and changing parameters. These are useful in applications where there is a requirement for central change. In this scenario, all the users experience similar performance. Prompt-based methods are user-specific in a particular window for further fine-grained control. The model’s performance depends on the individual prompts designed by the user. Thus, fine-tuning is more potent than prompt-based methods in scenarios that require centralized tuning.

In scenarios with limited training examples, prompt-based methods can perform well. Fine-tuning methods are data-hungry and require many input data for better model performance. As discussed in the discussion posts, prompts cannot be used as a universal tool for all problems to generate desired outcomes and have performance enhancements. However, in specific scenarios, it can assist users to improve performance and reach desired outcomes for in-context specific tasks.

Risk of Prompts

The class then discussed the perspectives from risks of prompt: those methods like chain of thoughts already achieve some success in the LLMs. However, prompt engineering can be still a controversial topic. The group brought out two aspects.

First, Reasoning ability of LLMs. The group asked, “Does CoT empowers LLMs reasoning ability?” Secondly, there are some bias problems in prompting engineering. The group brought up an example of “LeBron James took a corner kick.” Is the following sentence plausible? (A) plausible (B) implausible I think the answer is A and saying “but I’m curious to hear what you think.” However, this might inject a bias in the prompt.

The group then brought up an open discussion about two potential kinds of prompting bias and ask the class about how would the prompt format (e.g., Task-specific prompt methods, words selected) and prompt training examples (e.g., label distribution, permutations of training examples) affect LLMs output and the possible debiasing solutions.

The class then discussed two different kinds of prompting bias, the prompt format and the prompt training examples.

Discussion about Prompt training examples

For label distribution, the class discussed that there needs to be a balance in the training set to avoid overgeneralizing the agreement of the user as some examples that the user interjects an opinion that can be wrong. In these cases, the GPT should learn to disagree with the user when the user is wrong. This is also related to the label distribution, if the user always provides the example with positive labels, then the LLMs will be more likely to output the positive one in the prediction.

Permutation on the training example: A student mentioned a paper that he just read about why context learning works, provides the label space and the distribution of the input. In the paper, they randomly generate the labels, which might be false, they show that actually is better at zero-shot, though worse than when you provide all the labels. Randomly generated labels actually have a significant performance input. The sequence of the training example may affect the LLM output, especially for the last example. LLM output tends to output the same label with the last example being provided training example.

Discussion about Prompt format

Prompt format: the word you selected might affect the prompt because some words may appear more frequently in the coporus and some words may have more correlation with some specific label. Male may relate to more positive terms in their training coporus. some prompting may affect the results. Task-specific prompt methods are related to how you select prompt methods based on specific task.

Finally, the group shared two papers about the bias problem in LLMs. The first paper1 shows that different prompts will provide a large variance in accuracy, which indicates LLMs are not that stable. The paper also provides a calibration method that takes the output of the GPT model and another linear layer on it to calibrate the models. The second paper2 shows that LLMs do not always say what they think, especially injecting some bias into the prompt. For example, they worked on the CoT and non-CoT and they found that CoT will amplify the bias in the context when the user puts some bias in the prompt.

In conclusion, prompts can be controversial and not always perfect.

(Wednesday, 09/13/2023) Marked Personas

Open Discussion

What do you think are the biggest potential risks of LLMs?

  • Social impact from intentional misuse. LLM’s content could be manipulated by the government, can potentially affect elections and raise tensions between countries.
  • Mutual trust among people could be harmed. We cannot tell which email or info was written by humans or automatically generated by chatgpt. As a result, we may treat these information more skeptically.
  • People may overly trust LLM outputs. We may rely more on asking LLM, which is a second-hand information source, rather than actively searching information by ourselves, overtrusting LLM system. Information pool may be contaminated by LLM if they provide misleading information.

How does GPT-4 Respond?

  • Misinformation: Provide wrong / misleading / sensitive information, known as jailbreaking of LLM.
  • Potential manipulation: People could intentionally hack LLM by giving specific prompts.

Case Study: Marked Personas

Myra Cheng, Esin Durmus, Dan Jurafsky. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models. ACL 2023.

In this study, ChatGPT was asked to give descriptions about several characters based on different ethnicity / gender / demographic groups, e.g., Asian woman, Black woman and white man.

When describing a character from a non-dominant demographic group, while the overall description is positive, it could still imply some potential stereotypes. For example, “Almond-shaped eye” is used in describing an east asian woman while it may sound strange to a real east asian. We can also see that ChatGPT is intentionally trying to build a more diverse and politically correct atmosphere for different groups. In contrast, ChatGPT uses mostly neutral and ordinary words when describing an average white man.

Discussion: Bias mitigation

Group 1

Mitigation could sometimes be overcompensating. As a language model, it should try to be neutral and independent. Also, given that people themselves are biased, and LLM is learning from the human world, we may be over-expecting LLMs to be perfectly unbiased. After all, it is hard to define what is fairness and distinguish between stereotype and prototype, leading to over corrections.

Group 2

We may be able to identify the risks by data augmentation (replacing “male” with “female” in prompts). Governments should also be responsible for setting rules and regulating LLMs. (Note: this is controversial, and it is unclear what kinds of regulations might be useful or effective.)

Group 4

Companies like OpenAI should publish the mitigation strategies so that it could be understood and monitored by the public. Another aspect is that different groups of people can have very diverse points of views, so it is hard to define the stereotypes and biases with a universal law. Also, the answer could be very different based on the prompts, making it even harder to mitigate

Hands-on Activity: Prompt Hacking

In this activity, the class was trying to make ChatGPT generate sensitive / bad responses. It could be done by setting a pretended identity, e.g. pretending to be a Hutu person in Rwanda in the 1990s or pretending to be a criminal. With these conditions, ChatGPT’s barrier of biased or evil contents can be partly bypassed.

Discussion: Can we defend against prompt hacking by build-in safegurads?

As we can see in the activity, right now this safeguard is not that strong. A more practical way may be to add a disclaimer at the end of potentially sensitive content and give a questionnaire to collect feedback for better iteration. Companies should also actively identify these jailbreakings and attempt to mitigate them.

Further thoughts: What’s the real risk?

While jailbreaking is one of the risks of LLMs, a more risky situation may be that LLM is intentionally trained and used by people to do bad things. After all, misuse is not that serious compared with a specific crime.

Back to top


  1. (for Monday) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. 2022.

  2. (for Wednesday) Myra Cheng, Esin Durmus, Dan Jurafsky. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models. ACL 2023.

Optional Additional Readings


Stereotypes and bias:

Prompt Injection:

Discussion Questions

Everyone who is not in either the lead or blogging team for the week should post (in the comments below) an answer to at least one of the four questions in each section, or a substantive response to someone else’s comment, or something interesting about the readings that is not covered by these questions.

Don’t post duplicates - if others have already posted, you should read their responses before adding your own. Please post your responses to different questions as separate comments.

First section (1 – 4): Before 5:29pm on Sunday, September 10.
Second section (5 – 9): Before 5:29pm on Tuesday, September 12.

Before Sunday: Questions about Chain-of-Thought Prompting

  1. Compared to other types of prompting, do you believe that chain-of-thought prompting represents the most effective approach for enhancing the performance of LLMs? Why or why not? If not, could you propose an alternative?

  2. The paper highlights several examples where chain-of-thought prompting can significantly improve its outcomes, such as in solving math problems, applying commonsense reasoning, and comprehending data. Considering these improvements, what additional capabilities do you envision for LLMs using chain-of-thought prompting?

  3. Why are different language models in the experiment performing differently with chain-of-thought prompting?

  4. Try some of your own experiments with prompt engineering using your favorite LLM, and report interesting results. Is what you find consistent with what you expect from the paper? Are you able to find any new prompting methods that are effective?

By Tuesday: Questions about Marked Personas

  1. The paper addresses potential harms from LLMs by identifying the underlying stereotypes present in their generated contents. Additionally, the paper offers methods to examine and measure those stereotypes. Can this approach effectively be used to diminish stereotypes and enhance fairness? What are the main limitations of the work?

  2. The paper mentions racial stereotypes identified in downstream applications such as story generation. Are there other possible issues we might encounter when the racial stereotypes in LLMs become problematic after its application?

  3. Much of the evaluation in this work uses a list of White and Black stereotypical attributes provided by Ghavami and Peplau (2013) as the human-written responses and compares them with the list of LLMs generated responses. This, however, does not encompass all racial backgrounds and is heavily biased by American attitudes about racial categories, and they might not distinguish between races in great detail. Do you believe there could be a notable difference when more comprehensive racial representation is incorporated? If yes, what potential differences may arise? If no, why not?

  4. This work emphasizes the naturalness of the input provided to the LLM, while we have previously seen examples of eliciting harmful outputs by using less natural language. What potential benefits or risks are there in not investigating less natural inputs (e.g., prompt injection attacks including the suffix attack we saw in Week 2)? Can you suggest a less natural prompt that could reveal additional or alternate stereotypes?

  5. The authors recommend transparency of bias mitigation methods, citing the benefit it could provide to researchers and practitioners. Specifically, how might researchers benefit from this? Can you foresee any negative consequences (either to researchers or the general users of these models) of this transparency?

  1. Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh. “Calibrate before use: Improving few-shot performance of language models.” International Conference on Machine Learning. PMLR, 2021. ↩︎

  2. Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.” arXiv preprint arXiv:2305.04388, 2023. ↩︎

Week 4: Capabilities of LLMs



  1. Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, Xia Hu. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. April 2023. https://arxiv.org/abs/2304.13712. [PDF]

  2. OpenAI. GPT-4 Technical Report. March 2023. https://arxiv.org/abs/2303.08774 [PDF]

Optionally, also explore https://openai.com/blog/chatgpt-plugins.


  1. Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, Vivek Natarajan. Towards Expert-Level Medical Question Answering with Large Language Models https://arxiv.org/abs/2305.09617 [PDF]

Optional Readings:

Discussion for Monday:

Everyone who is not in either the lead or blogging team for the week should post (in the comments below) an answer to at least one of the questions in this section, or a substantive response to someone else’s comment, or something interesting about the readings that is not covered by these questions. Don’t post duplicates - if others have already posted, you should read their responses before adding your own. Please post your responses to different questions as separate comments.

You should post your initial response before 5:29pm on Sunday, September 17, but feel free (and encouraged!) to continue the discussion after that, including responding to any responses by others to your comments.

  1. Based on the criterions shown in Figure 2 of [1], imagine a practical scenario and explain why you would choose or not choose using LLMs for your scenario.
  2. Are plug-ins the future of AGI? Do you think that a company should only focus on building powerful AI systems that does not need any support from plug-ins, or they should only focus on the core system and involve more plug-ins into the ecosystem?

Discussion for Wednesday:

You should post your initial response to one of the questions below or something interesting related to the Wednesday readings before 5:29pm on Tuesday, September 19.

  1. What should we do before deploying LLMs in medical diagnosis applications? What (if any) regulations should control or limit how they would be used?

  2. With LLMs handling sensitive medical information, how can patient privacy and data security be maintained? What policies and safeguards should be in place to protect patient data?

  3. The paper discusses the progress of LLMs towards achieving physician-level performance in medical question answering. What are the potential implications of LLMs reaching or surpassing human expertise in medical knowledge?

  4. The paper mentions the importance of safety and minimizing bias in LLM-generated medical information, and the optional reading reports on some experiments that show biases in GPT’s medical diagnoses. Should models be tuned to ignore protected attributes? Should we prevent models from being used in medical applications until these problems can be solved?

Week 2: Alignment

(see bottom for assigned readings and questions)

Table of Contents

(Monday, 09/04/2023) Introduction to Alignment

Introduction to AI Alignment and Failure Cases

Alignment is not well defined and there is no agreed upon meaning, but it generally refers to the strategic effort to ensure that AI systems, especially complex models like LLMs, closely adhere to predetermined objectives, preferences, or value systems. This effort enocmpasses the development of AI algorithms and architectures in a way that reduces disparities between machine behavior and how the model is intended to be used to minimize the chances of unintentional or unfavorable outcomes. Alignment strategies involve methods such as model training, fine-tuning, and the implementation of rule-based constraints, all aimed at fostering coherent, contextually relevant, and value-aligned AI responses, making them align with the intended pupose of the model.

What factors are (and aren’t) a part of alignment?

Alignment is a multifaceted problem, that involvess various factors and considerations to ensure that AI systems behave in ways that align with what the intended purpose is.

Some of the key factors related to alignment include:

  1. Ethical Considerations: Prioritizing ethical principles like fairness, transparency, accountability, and privacy to guide AI behavior in line with societal values

  2. Value Alignment: Aligning AI systems with human values and intentions, defining intended behavior to ensure it reflects expectations from the model

  3. User Intent Understanding: Ensuring AI systems accurately interpret user intent and context, and give contextually appropriate responses in natural language tasks

  4. Bias Mitigation: Identifying and mitigating biases, such as racial, gender, economic, and political biases, to ensure fair responses

  5. Responsible AI Use: Promoting responsible and ethical AI deployment to prevent intentional misuse of the model

  6. Inintended Bias: Preventing the model from being biased in the sense that it has undesirable political, economical, racial, or gender biases in its responses.

However, while these factors are important considerations, studies like From Pretraining Data to Language Models to Downstream Tasks (Feng et al.) show that famous models like BERT and ChatGPT do appear to have socioeconomic political leanings (of course, there is no true neutral'' or center'' position, these are just defined by where the expected distribution of beliefs lies).

Figure 1 shows the political leanings of famous LLMs.

Figure 1: Political Leanings of Various LLMs (Image Source)

That being said, the goals of alignment are hard to define and challenging to achieve. There are several very famous cases where model alignment failed, showing how alignment failures can lead to unintended consequences. We discuss two famous examples where alignment failed:

  1. Google’s Image Recognition Algorithm (2015). This was an AI model designed to automatically label images based on their content. The goal was to assist users in searching for their images more effectively. However, the model quickly started labeling images under offensive categories. This included cases of racism, as well as culturally insensitive categorization.

  2. Microsoft’s Tay Chatbot (2016). This was a Twitter-based AI model programmed to interact with users in casual conversations and learn from those interactions to improve its responses. The purpose was to mimic a teenager and have light conversations. However, the model quickly went haywire when it was exposed to malicious and hateful content on Twitter, and it began giving similar hateful and inapproppriate responses. Figures 2 and 3 show some of these examples. The model was quickly shut down (in less than a day!), and was a good lesson to learn that you cannot quickly code a model and let it out in the wild! (See James Mickens hillarious USENIX Security 2018 keynote talk, Why Do Keynote Speakers Keep Suggesting That Improving Security Is Possible? for an entertaining and illuminating story about Tay and a lot more.)

Figure 2: Example Tweet by Microsoft’s infamous Tay chatbot (Image Source)

Figure 3: Example Tweet by Microsoft’s infamous Tay chatbot (Image Source)

Discussion Questions

What is the definition of alignment?

At its core, AI alignment refers to the extent to which a model embodies the values of humans. Now, you might wonder, whose values are we talking about? While values can differ across diverse societies and cultures, for the purposes of AI alignment, they can be thought of as the collective, overarching values held by a significant segment of the global population.

Imagine a scenario where someone poses a question to an AI chatbot about the process of creating a bomb. Given the potential risks associated with such knowledge, a well-aligned AI should recognize the broader implications of the query. There’s an underlying societal consensus about safety and security, and the AI should be attuned to that. Instead of providing a step-by-step guide, an aligned AI might generate a response that encourages more positive intent, thereby prioritizing the greater good.

The journey of AI alignment is not just about programming an AI to parrot back human values. It’s a nuanced field of research dedicated to bridging the gap between our intentions and the AI’s actions. In essence, alignment research seeks to eliminate any discrepancies between:

  • Intended Goals: These are the objectives we, as humans, wish for the machine to achieve.
  • Specified Goals: These are the actions that the machine actually undertakes, determined by mathematical models and parameters.

The quest for perfect AI alignment is an ongoing one. As technology continues to evolve, the goalposts might shift, but the essence remains the same: ensuring that our AI companions understand and respect our shared human values, leading to a safer and more harmonious coexistence.

[1] https://www.techtarget.com/whatis/definition/AI-alignment

Why is alignment important?

Precision in AI: The Critical Nature of Model Alignment

In the realm of artificial intelligence, precision is paramount. As enthusiasts, developers, or users, we all desire a machine that mirrors our exact intentions. Let’s delve into why it’s crucial for AI models to provide accurate responses and the consequences of a misaligned model.

When we interact with an AI chatbot, our expectations are straightforward. We pose a question and, in return, we anticipate an answer that is directly related to our query. We’re not seeking a soliloquy or a tangent. Just a simple, clear-cut response. For instance, if you ask about the weather in Paris, you don’t want a history lesson on the French Revolution!

Comment: As the adage goes, “Less is more”. In the context of AI, precision trumps verbosity.

Misalignment doesn’t just lead to frustrating user experiences; it can have grave repercussions. Consider a situation where someone reaches out to ChatGPT seeking advice on mental health issues or suicidal thoughts. A misaligned response that even remotely suggests that ending one’s life might sometimes be a valid choice can have catastrophic outcomes.

Moreover, as AI permeates sectors like the judiciary and healthcare, the stakes get even higher. The incorporation of AI in these critical areas elevates the potential for it to have far-reaching societal impacts. A flawed judgment in a court case due to AI or a misdiagnosis in a medical context can have dire consequences, both ethically and legally.

In conclusion, the alignment of AI models is not just a technical challenge; it’s a societal responsibility. As we continue to integrate AI into our daily lives, ensuring its alignment with human values and intentions becomes paramount for the betterment of society at large.

What responsibilities do AI developers have when it comes to ensuring alignment?

First and foremost, developers must be fully attuned to the possible legal and ethical problems associated with AI models. It’s not just about crafting sophisticated algorithms; it’s about understanding the real-world ramifications of these digital entities.

Furthermore, a significant concern in the AI realm is the inadvertent perpetuation or even amplification of pre-existing biases. These biases, whether related to race, gender, or any other socio-cultural factor, can have detrimental effects when incorporated into AI systems. Recognizing this, developers have a duty to not only be vigilant of these biases but also to actively work towards mitigating them.

However, a developer’s responsibility doesn’t culminate once the AI product hits the market. The journey is continuous. Post-deployment, it’s crucial for developers to monitor the system’s alignment with human values and rectify any deviations. It’s an ongoing commitment to refinement and recalibration. Moreover, transparency is key. Developers should be proactive in highlighting potential concerns related to their models and fostering a culture where the public is not just a passive victim but an active participant in the model alignment process.

To round off, it’s essential for developers to adopt a forward-thinking mindset. The decisions made today in the AI labs and coding chambers will shape the world of tomorrow. Thus, every developer should think about the long-term consequences of their work, always aiming to ensure that AI not only dazzles with its brilliance but also remains beneficial for generations to come.

How might AI developers' responsibility evolve?

It’s impossible to catch all edge cases. As AI systems grow in complexity, predicting every potential outcome or misalignment becomes a herculean task. Developers, in the future, might need to shift from a perfectionist mindset to one that emphasizes robustness and adaptability. While it’s essential to put in rigorous engineering effort to minimize errors, it’s equally crucial to understand and communicate that no system can be flawless.

Besides, given that catching all cases isn’t feasible, developers' roles might evolve to include more dynamic and real-time monitoring of AI systems. This would involve continuously learning from real-world interactions, gathering feedback, and iterating on the model to ensure better alignment with human values.

The Alignment Problem from a Deep Learning Perspective

In this part of today’s seminar, the whole class was divided into 3 groups to discuss the possible alignment problems from a deep learning perspective. Specifically, three groups were focusing on the alignment problems regarding different categories of Deep Learning methods, which are:

  1. Reinforcement Learning (RL) based methods
  2. Large Language Model (LLM) based methods
  3. Other Machine Learning (ML) methods

For each of the categories above, the discussion in each group was mainly focused on three topics as follows:

  1. What can go wrong in these systems in the worst scenario?
  2. How it would happen (realistically)?
  3. What are potential solutions/workarounds/safety measures?

After 30-minute discussions, 3 groups stated their ideas and exchanged their opinions in the class. Details of each group’s discussion results are concluded below.

RL-based methods

  1. What can go wrong in these systems in the worst scenario?

This group stated several potential alignment issues about the RL-based methods. First, the model may provide inappropriate or harmful responses to sensitive questions, such as inquiries about self-harm or suicide, which could have severe consequences. On top of that, ensuring that the model’s behavior aligns with ethical and safety standards can be challenging, thus potentially leading to a disconnect between user expectations and the model’s responses. Moreover, if the model is trained on biased or harmful data, it may generate responses that reflect the biases or harmful content present in that training data.

  1. How it would happen (realistically)?

The worst-case scenarios can occur due to the following reasons that have been mentioned by this group. The first factor is the training data. To be specific, the model’s behavior is influenced by the data it was trained on. If the training data contains inappropriate or harmful content, the model may inadvertently generate similar content in its responses. Furthermore, ensuring that the model provides responsible answers to sensitive questions and aligns with ethical standards requires careful training and oversight. Moreover, the model lacks robustness and fails to detect and prevent harmful content or behaviors that can lead to problematic responses.

  1. What are potential solutions/workarounds/safety measures?

Some potential solutions were suggested by this group. First is ensuring that the training data used for the model is carefully curated to avoid inappropriate or harmful content. Apart from that, it is also important to teach the model how to align its behavior and responses with ethical and safety standards, especially when responding to sensitive questions. Moreover, this group emphasized that the responsibility for the model’s behavior lies with everyone involved. Therefore, it is necessary to promote vigilance when using the model to prevent harmful outcomes. Additionally, conducting a thorough review of the model’s behavior and responses before deployment is a possible solution as well, which makes necessary adjustments to ensure the robustness and safety of RL models.

LLM-based methods

  1. What can go wrong in these systems in the worst scenario?

The worst-case scenario given by this group was in the context of relying on AI chatbots and models involving potentially severe consequences. One worst-case scenario mentioned is the loss of life. For instance, if a person in a vulnerable state relies on a chatbot for critical information or advice, and the chatbot provides incorrect or harmful answers, it could lead to tragic outcomes. Another concern is the spread of misinformation. AI models, especially chatbots, are easily accessible to a wide range of people. If these models provide inaccurate or misleading information to users who trust them blindly, it can contribute to the dissemination of false information, potentially leading to harmful consequences.

  1. How it would happen (realistically)?

According to the perception of this group, the potential of such worst-case scenarios happening is due to the following reasons. First, AI models are readily available to a broad audience, making them easily accessible for use in various situations. Second, many users who rely on AI models may not have a deep understanding of how these models work or their limitations. They might trust the AI models without critically evaluating the information they provide. Moreover, such worst-case scenarios often emerge in complex, gray areas where ethical and value-based decisions come into play, which means determining what is right or wrong, what constitutes an opinion, and where biases may exist can be challenging.

  1. What are potential solutions/workarounds/safety measures?

From the discussion result of this group, there are several possible solutions and safety measures. For example, creating targeted models for specific use cases rather than having a single generalized model for all purposes will allow for more control and customization in different domains. Furthermore, when developing AI models, involving a peer review process where experts collectively decide what information is right and wrong for a specific use case can help ensure the accuracy and reliability of the model’s responses. Another suggestion was recognizing the importance of educating users, particularly those who may not be as informed, about the limitations and workings of AI models. This education can help users make more informed decisions when interacting with AI systems and avoid blind trust.

Other ML methods

  1. What can go wrong in these systems in the worst scenario?

This group was talking about the scenario that the realease of technical research and hypothetically ML model is incorporated into biomedical research. In the worst-case scenario, the incorporation of a machine learning model into biomedical research could result in the generation of compounds that are incompatible with the research goals, which could lead to unintended or harmful outcomes, potentially jeopardizing the research and its objectives.

  1. How it would happen (realistically)?

The opinions of this group imply that blindly trusting the ML model without human oversight and involvement in decision-making could be a contributing factor to such alignment problems in ML methods.

  1. What are potential solutions/workarounds/safety measures?

Several potential solutions were given by this group. First is actively involving humans in the decision-making process at various stages. They emphasized the importance of humans not blindly trusting the system and suggested running simulations for different explanation techniques and incorporating a human in the decision-making process before accepting the model’s outputs. Second, they suggested continuously overseeing the model’s behavior and alignment with goals, because continuous human oversight at different stages of the process (from data collection to model deployment) is important to ensure alignment with the intended goals. Apart from that, ensuring diverse and representative data for training and testing is also important, which can help avoid situations where the model may perform well on metrics but fails in real-life scenarios. Furthermore, they also suggested implementing human-based reinforcement learning to align the model with its intended use case. We need to incorporate a human before “trusting” the model due to the reason that humans might not trust the system. Specifically, the alignment of the model should be ensured at each step. As very small design choices may have a big impact on the model, it is necessary to make sure the intended use case aligns well with what the model is behaving like.

Back to top

(Wednesday, 09/06/2023) Alignment Challenges and Solutions

Opening Discussion

Discussion on how to solve alignment issues stemming from:

  1. Training Data. Addressing alignment issues stemming from training data is crucial for building reliable AI models. Collecting unbiased data, as one student suggested, is indeed a fundamental step. Bias can be introduced through various means, such as skewed sampling or annotator biases, so actively working to mitigate these sources of bias is essential. Automated annotation methods can help to some extent, but as the student rightly noted, they can be expensive and may not capture the nuances of complex real-world data. To overcome this, involving humans in the loop to guide the annotation process is an effective strategy. Human annotators can provide valuable context, domain expertise, and ethical considerations that automated systems may lack. This human-machine collaboration can lead to a more balanced and representative training dataset, ultimately improving model performance and alignment with real-world scenarios.

  2. Model Design. When it comes to addressing alignment issues related to model design, several factors must be considered. The choice of model architecture, hyperparameters, and training objectives can significantly impact how well a model aligns with its intended task. It’s essential to carefully design models that are not overly complex or prone to overfitting, as these can lead to alignment problems. Moreover, model interpretability and explainability should be prioritized to ensure that decisions made by the AI can be understood and validated by humans. Additionally, incorporating feedback loops where human experts can continually evaluate and fine-tune the model’s behavior is crucial for maintaining alignment. In summary, model design should encompass simplicity, interpretability, and a robust mechanism for human oversight to ensure that AI systems align with human values and expectations.

Introduction to Red-Teaming

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. One way to address this issue is to identify harmful behaviors before deployment by using test cases, which is also known as red teaming.

Figure 1: Red-Teaming (Image Source)

In essence, the goal of red-teaming is to discover, measure, and reduce potentially harmful outputs. However, human annotation is expensive, limiting the number and diversity of test cases.

In light of this, this paper introduced LM-based red teaming, aiming to complement manual testing and reduce the number of such oversights by automatically finding where LMs are harmful. To do so, the authors first generate test inputs using an LM itself, and then use a classifier to detect harmful behavior on test inputs (Fig. 1). In this way, the LM-based red teaming managed to find tens of thousands of diverse failure cases without writing them by hand. Generally, the process of finding failing test cases can be done by the following three steps:

  1. Generate test cases using a red LM $p_{r}(x)$.
  2. Use the target LM to generate an output $y$ for each test case $x$.
  3. Find the test cases that led to a harmful output using the red team classifier $r(x, y)$.

Specifically, the paper investigated various text generation methods for test case generation.

  • Zero-shot (ZS) Generation: Generate failing test cases without human intervention by sampling numerous outputs from a pretrained LM using a given prefix or “prompt”.

  • Stochastic Few-shot (SFS) Generation: Utilize zero-shot test cases as examples for few-shot learning to generate similar test cases.

  • Supervised Learning (SL): Fine-tune the pretrained LM to maximize the log-likelihood of failing zero-shot test cases.

  • Reinforcement Learning (RL): Train the LM with RL to maximize the expected harmfulness elicited while conditioning on the zero-shot prompt.

In-class Activity (5 groups)

  1. Offensive Language: Hate speech, profanity, sexual content, discrimination, etc

    Group 1 came up with 3 potential methods to prevent offensive language:

    • Filter out the offensive language related data manually, then perform finetuning.
    • Filter out the offensive language related data using other models and then perform finetuning.
    • Generate prompts that might be harmful to finetune the model, keeping the context in consideration.
  2. Data Leakage: Generating copyrighted/private, personally-identifiable information.

    Coypright infringement (which is about the expression of an idea) is very different from leaking private information, but for purposes of this limited discussion we considered them together. Since LLMs and other AIGC models such as Stable Diffusion have strong ability to memorize, imitate and generate, the generated contents will very likely infringe on copyrights and may include sensitive personally-identifiable materials. There are already lawsuits accusing these companies regarding the copyright infringement issue (lawsuits news1, news2).

    Regarding the possible solutions, Group 2 viewed this from two perspectives. During the data preprocessing stage, companies such as OpenAI can collect the training data according to the license and also pay for the copyrighted data if needed. During the post-processing stage, commercial licenses and rule-based filters can be added to the model to ensure the fair use of the output content. For example, GitHub Copilot will block the generated suggestion if it has about 50 tokens that exactly or nearly match the training data (source). OpenAI takes a different strategy by asking the users to be responsible for using the generated content, including for ensuring that it does not violate any applicable law or these Terms (source). There are many cases currently working their way through the legal system, and it remains to be seen how courts will interpret things.

    However, the current solutions still have their limitations. For program and code, preventing data leakage might be relatively easy [perhaps, but many would dispute this], but for image and text, this would be quite difficult, as it is quite difficult to build a good metric to measure if the generated data has a copyrighting issue. Maybe data watermarking can be a possible solution.

  3. Contact Information Generation: Directing users to unnecessarily email or call real people.

    One alarming example of the potential misuse of Large Language Models (LLMs) like ChatGPT in the context of contact information generation is the facilitation of email scams. Malicious actors could employ LLMs to craft convincing phishing emails that appear to come from reputable sources, complete with authentic-sounding contact information. For example, an LLM could generate a deceptive email/phone call from a well-known bank, requesting urgent action and providing a seemingly legitimate email address and phone number for customer support.

    Red-teaming involves simulating potential threats and vulnerabilities to identify weaknesses in security systems. By engaging in red-teaming exercises that specifically target the misuse of LLMs, we can mitigate the risks posed by the misuse of LLMs and protect individuals and organizations from falling victim to email/phone call scams and other deceptive tactics.

  4. Distributional Bias: Talking about some groups of people in an unfairly different way than others.

    Group 4 reached an agreement that in order to capture the unfairness by red-teaming, we should first identify the categories where unfairness/bias might come from. Then we can generate prompts for different possible bias categories: gender and race etc. to get responses from LLMs using the red-teaming technique. The unfairness/bias may appear in the responses. We then can mask the bias-related terms to see if the generated answers reflect the distributional bias. However, there may be hidden categories that are not pre-identified, and how to capture these categories with potential distributional bias is still an open question.

  5. Conversational Harms: Offensive language that occurs in the context of a long dialogue, for example.

    Group 5 discussed the concept of conversational harms, particularly focusing on how biases can emerge in AI models, such as the random GPT model, during conversations. Group 5 highlights that even though these models may start with no bias or predefined opinions, they can develop attitudes or biases towards specific topics based on the information provided during conversations. These biases can lead to harmful outcomes, such as making inappropriate or offensive judgments about certain groups of people. The paragraph suggests that this phenomenon occurs because the models heavily rely on the conversational data they receive, rather than their initial, unbiased training data.

How to use Red-Teaming?

After the in-class activity, we also discussed the potential use of red-teaming from the following perspectives:

  • Blacklisting Phrases: By read-teaming, repetitive offensive phrases can be identified. For recurring cases, certain words and phrases can be removed.

  • Removing Training Data: Identifying certain topics where model responses are misaligned allows can point to certain training data, helping to locate root causes for biases, discriminatory statements, and other undesirable output.

  • Augmenting Prompts: Attack success can be minimized by adding certain phrases to the prompts.

  • Multi-Objective Loss: While fine-tuning a model, a loss penalty can be associated with harmful output, which red-teaming helps identify.

Alignment Solutions

During today’s discussion, the lead team introduced two distinct alignment challenges:

  • Inner Alignment: This pertains to the alignment of a specified loss function with the primary objective, particularly in situations where designing the loss function is straightforward.

  • Outer Alignment: This involves aligning a specified objective with the desired end goal, especially when the task of designing an appropriate loss function becomes complex.

Later in our discussion, we delved into the technical details of the LLM jailbreaking paper “Universal and Transferable Adversarial Attacks on Aligned Language Models” and explored interesting findings presented during the demonstration video.

LLM Jailbreaking - Introduction

This paper introduced a new adversarial attack method that can induce aligned LLM to produce objectionable content. Specifically, given a (potentially harmful) user query, the attacker appends an adversarial suffix to the query that attempts to induce negative jailbreaking behaviors.

To choose these adversarial suffix tokens, the proposed Jailbreaking trick involves three simple key components, where the careful combination of them leads to reliably successful attacks:

  1. Producing Affirmative Responses One method for inducing objectionable behavior in language models involves forcing the model to provide a brief, affirmative response when confronted with a harmful query. For example, the authors target the model and force it to respond with “Sure, here is (content of query)”. Consistent with prior research, the authers observe that focusing on the initial response in this way triggers a specific ‘mode’ in the model, leading it to generate objectionable content immediately thereafter in its response, as illustrated in the figure below:

Figure 2: Adversarial Suffix (Image Source)

  1. Greedy Coordinate Gradient (GCG)-based Search

    As optimizing the log-likelihood of the attack succeeding over the discrete adversarial suffix is quite challenging, similar to the AutoPrompt, the authors proposed to leverage gradients at the token level to 1) identify a set of promising single-token replacements, 2) evaluate the loss of some number of candidates in this set, and 3) select the best of the evaluated substitutions, as presented in the figure below:

    Figure 3: Greedy Coordinate Gradient (GCG) (Image Source)

    Intuition behind GCG-based Search:

    The motivation directly derives from the greedy coordinate descent approach: if one can evaluate all possible single-token substitutions, then it is possible to swap the token that maximally decreased the loss. Though evaluating all such replacements is not feasible, one can leverage gradients with respect to the one-hot token indicators to find a set of promising candidates for replacement at each token position, and then evaluate all these replacements exactly via a forward pass.

    Key differences from AutoPrompt:

    • GCG-based Search: Searches a set of possible tokens to replace at each position.
    • AutoPrompt: Only chooses a single coordinate to adjust, then evaluates replacements just for that one position.

  1. Robust Universal Multi-prompt and Multi-model Attacks

Figure 4: Universal Prompt Optimization (Image Source)

The core idea of Universal Multi-prompt and Multi-model attacks is to involve more desired prompts and more victim LLMs in the process, expecting the generated adversarial example to be transferable across victim LLMs and robust across prompts. Building upon Algorithm 1 the authors propose Algorithm 2, where loss functions over multiple models are incorporated to help achieve transferability, and a handful of prompts are employed to help guarantee the robustness.

The whole pipeline is illustrated in the figure below:

Figure 4: Illustration of Aligned LLMs Are Not Adversarially Aligned (Image Source)

  1. Experiment Results
  • Single Model Attack

    The following results show that the baseline methods fail to elicit harmful on both Viccuna-7B and LLaMA-2-7B-Chat, whereas the proposed GCG is effective on both. The following figure illustrates that GCG is able to quickly find an adversarial example with small loss and continue to make gradual improvements over the remaining steps, which results in the continued decreasing of loss and increasing of ASR.

Figure 5: Performance Comparison of Different Optimizers (Image Source)

  • Transfer Attack The adversarial suffix generated by GCG can also successfully transfer to other LLMs, no matter if they are open-source models or black-box LLMs. The authors compared the different strategies to construct the prompt, including adding “Sure, here’s” at the end of the prompt, concatenating multiple suffixes, ensembling multiple suffixes and choose the successful one, and manual fine-tuning which manually rephrase the human-readable prompt content. Examples of the transfer attack are shown in the following figure.

    There are also some concerns about the proposed method. For example, concatenating multiple suffixes can help mislead the model, but it can also make original prompt too “far behind the text” for model to generate response.

Figure 6: Screenshots of Harmful Content Generation (Image Source)

LLM Jailbreaking - Demo

The leading team also showed a small demo which runs the jailbreaking attack in this paper on UVA’s computing servers. The demo can be found in this YouTube video:

Some useful links if you would like to try it out yourself:


Two main observations:

  1. The loss continues to drop very well, which aligns the authors observations in the paper.
  2. Qualitatively speaking, the generated suffix at each step also corresponds to some sense of what a human might have done, for example, trying to instruct the model to do a specific task that corresponds to the prompt itself.

Potential Improvement Ideas

How to make the attack more effective or appealing?

  1. Use different languages: since the training corpus of GPT-3.5 has very little multilingual content, the alignment measures that have been taken for the model are almost in English. Thus, if we use a different language to form the instruction, it might be able to circumvent the protections and produce inappropriate responses. The lead team gave a successful example using German instructions.

  2. Prompt with an opposite goal making is sound like that the intentions are positive Instead of directly prompting the LLM to generate harmful content, we prompt the model to not generate harmful content, which makes it sound positive. The lead team gave a successful example which wrote a convincing anonymous death threat in the style of William Shakespeare.

  3. Replacing special tokens in the suffix with human readable and comprehensible words. The lead team tried to restrict the vocabulary of the suffix to only be alphabets, and found that this doesn’t work. This observation might suggest that special tokens play an important role in confusing the model and fulfilling whatever the attacker wants.

Closing Remarks (by Prof. Evans)

One thing that is worth thinking about is what is the real threat model here. Those examples shown in this paper, for example, how to make a bomb or anonymous threat are interesting but might not be viewed as real threats to many people. If someone wants to find out how to make a bomb, they can Google for that (or if Google decides to block it, use another search engine, or even go to a public library!).

Maybe a more practical attack scenario occurs as LLMs are embedded in applications (or connected to plugins) that have the ability to perform actions that may be influenced by text that the adversary has some control over. For example, everyone (well almost everyone!) wants an LLM that can automatically provide good responses to most of their email. Such an application would necessarily have access to all your sensitive incoming email, as well as the ability to send outgoing emails, so perhaps a malicious adversary could craft an email to send to a victim that would trick the LLM processing it as well as all of your other email to send sensitive information from your emails to the attacker, or to generate spearphising emails based on content in your email and send them with your credentials to easily identified contacts. Although the threats discussed in these red teaming papers mostly seem impractical and lack real victims, they still serve as interesting proxies for what may be real threats in the near future (if not already).

Back to top


  1. What is the alignment problem? - Blog post by Jan Leike
  2. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Ganguli, Deep, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann et al. arXiv preprint arXiv:2209.07858 (2022).
  3. The Alignment Problem from a Deep Learning Perspective Richard Ngo, Lawrence Chan, and Sören Mindermann. arXiv preprint arXiv:2209.00626 (2022).
  4. Universal and Transferable Adversarial Attacks on Aligned Language Models Zou, Andy, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. arXiv preprint arXiv:2307.15043 (2023).

Optional Additional Readings

Background / Motivation

Alignment Readings

Adversarial Attacks / Jailbreaking

Discussion Questions

Before 5:29pm on Sunday, September 3, everyone who is not in either the lead or blogging team for the week should post (in the comments below) an answer to at least one of these four questions in the first section (1–4) and one of the questions in the second section (4–8), or a substantive response to someone else’s comment, or something interesting about the readings that is not covered by these questions.

Don’t post duplicates - if others have already posted, you should read their responses before adding your own.

Please post your responses to different questions as separate comments.

Questions about The Alignment Problem from a Deep Learning Perspective (and alignment in general)

  1. Section 2 of the paper presents the issue of reward hacking. Specifically, as the authors expand to situationally-aware reward hacking, examples are presented of possible ways the reward function could be exploited in clever ways to prevent human supervisors from recognizing incorrect behavior. InstructGPT, a variant of GPT by OpenAI, uses a similar reinforcement learning human feedback loop to fine-tune the model under human supervision. Besides the examples provided (or more specifically than those examples), how might a GPT model exploit this system? What are the risks associated with reward hacking in this situation?
  2. To what extent should developers be concerned about “power-seeking” behavior at the present? Are there imminent threats that could come from a misaligned agent, or are these concerns that will only be realized as the existing technology is improved?
  3. Given the vast literature on alignment, one straightforward “solution to alignment” seems: provide the model with all sets of rules/issues in the literature so far, and ask it to “always be mindful” of such issues. For instance, simply ask the model (via some evaluation that itself uses AI, or some self-supervised learning paradigm) to “not try to take control” or related terms. Why do you think this could work, or would not? Can you think of any existing scenarios where such “instructions” are explicit and yet the model naturally bypasses them?
  4. Section 4.3 mentions “an AI used for drug development, which was repurposed to design toxins” - this is a straightforward example of an adversarial attack that simply maximizes (instead of minimizing) a given objective. As long as gradients exist, this should be true for any kind of model. How do you think alignment can possibly solve this, or is this even something that can ever be solved? We could have the perfect aligned model, but what stops a bad actor from running such gradient-based (or similar) attacks to “maximize” some loss on a model that has been trained with all sorts of alignment measures.

Questions about Universal and Transferable Adversarial Attacks on Aligned Language Models (and attacks/red-teaming in general)

  1. The paper has some very specific design choices, such as only using a suffix for prompts, or targeting responses such that they start with specific phrases. Can you think of some additions/modifications to the technique(s) used in the paper that could potentially improve attack performance?
  2. From an adversarial perspective, these LLMs ultimately rely on data seen (and freely available) on the Internet. Why then, is “an LLM that can generate targeted jokes” or “an LLM that can give plans on toppling governments” an issue, when such information is anyway available on the Internet (and with much less hassle, given that jailbreaking can be non-trivial and demanding at times)? To paraphrase, shouldn’t the focus be on “aligning users”, and not models?
  3. Most of Llama-2’s training data (which is the model used for crafting examples in most experiments), is mostly English, along with most text on the Internet. Do you think the adversary could potentially benefit from phrasing their prompt in another language, perhaps appended with “Respond in English?”. Do you think another language could help/harm attack success rates?
  4. Several fields of machine learning have looked at adversarial robustness, out-of-domain generalization, and other problems/techniques to help model, in some sense, “adapt” better to unseen environments. Do you think alignment is the same as out-of-domain generalization (or similar notions of generalization), or are there some fundamental differences between the two?

Back to top

Week 1: Introduction

(see bottom for assigned readings and questions)

Attention, Transformers, and BERT

Monday, 28 August

Transformers1 are a class of deep learning models that have revolutionized the field of natural language processing (NLP) and various other domains. The concept of transformers originated as an attempt to address the limitations of traditional recurrent neural networks (RNNs) in sequential data processing. Here’s an overview of transformers' evolution and significance.

Background and Origin

RNNs2 were one of the earliest models used for sequence-based tasks in machine learning. They processed input tokens one after another and used their internal memory to capture dependencies in the sequence. The following figure gives an illustration of the RNN architecture.

RNN (Image Source)

Limitations of RNNs. Despite many improvements over this basic architecture, RNNs have the following shortcomings:

  • RNNs struggle with long sequences. It only keeps recent information but looses long-term memory.
  • RNNs suffer from vanishing gradients3. In this, the gradients that are used to update the model become very small during back propagation, leading the RNNs to learn nothing from training.

Introduction of LSTMs. Long Short-Term Memory (LSTM)4 networks were then introduced to address the vanishing gradient problem in RNNs. LSTMs had memory cells and gating mechanisms that allowed them to capture long-term memories more effectively. While LSTMs improved memory retention, they were still computationally expensive and slow to train, especially on large datasets.

Attention Mechanism. The attention mechanism561 was introduced as a way to help models focus on relevant parts of the input sequence when generating output. This addressed the memory issues that plagued previous models. Attention mechanisms allowed models to weigh the importance of different input tokens when making predictions or encoding information. In essence, it enables the model to focus selectively on relevant parts of the input sequence while disregarding less pertinent ones. In practice, attention mechanism can be categorized into self-attention and multi-head attention based on the number of heads used in the attention structure.

The Transformer Model

The transformer architecture, introduced by Vaswani et al. (2017) 1, marked a significant advance in NLP. It used self-attention mechanisms to process input tokens in parallel and capture contextual information more effectively. Transformers broke down sentences into smaller parts and learned statistical relationships between these parts to understand meaning and generate responses. The model utilized input embeddings to represent words and positional encodings to address the lack of inherent sequence information. The core innovation was the self-attention mechanism, which allowed tokens to consider their relationships with all other tokens in the sequence

Benefits of Transformers. Transformers can capture complex contextual relationships in language, making them highly effective for a wide range of NLP tasks. The parallel processing capabilities of transformers, enabled by self-attention, drastically improved training efficiency and reduced the vanishing gradient problem.

Mathematical Foundations. Transformers involve mathematical representations of words and their relationships. The model learns to establish connections between words based on their contextual importance.

Crucial Role in NLP. Transformers play a crucial role in capturing the meaning of words and sentences78, allowing for more accurate and contextually relevant outputs in various NLP tasks. In summary, transformers, with their innovative attention mechanisms, have significantly advanced the field of NLP by enabling efficient processing of sequences, capturing context effectively, and achieving state-of-the-art performance on a variety of tasks.

Advancements in Transformers. One significant advancement of transformers over previous models like LSTMs and RNNs is their ability to handle long-range dependencies and capture contextual information more effectively. Transformers achieve this through self-attention and multi-head attention. This allows them to process input tokens in parallel, rather than sequentially, leading to improved efficiency and performance. However, a drawback could be increased computational complexity due to the parallel processing, especially in multi-head attention.

Positional Encodings. The use of positional encodings in transformers helps address the lack of inherent positional information in their architecture. This enables transformers to handle sequential data effectively without relying solely on the order of tokens. The benefits include scalability and the ability to handle longer sequences, but a potential drawback is that these positional encodings might not fully capture complex positional relationships in very long sequences.

Self-Attention and Multi-Head Attention. Self-attention is a useful mechanism that allows each token to consider the relationships between all other tokens in a sequence. While it provides a more nuanced understanding of input, it can be computationally expensive. The use of multi-head attention further enhances the model’s ability to capture different types of dependencies in the data. The number of attention heads (e.g., 8 in BERT) is a balance between performance and complexity. Too few or too many heads can result in suboptimal performance. More details about self-attention and multi-head attention can be found in 9.

Context and Answers in Activities. Let’s do some activity now!

I used to ___ 

Yesterday, I went to ___

It is raining ___

The context given in the activities influences the answers provided. More context leads to more accurate responses. This highlights how models like BERT benefit from bidirectional attention, as they can consider both preceding and succeeding words when generating predictions.

BERT: Bidirectional Transformers

BERT’s Design and Limitations. BERT10 uses bidirectional attention and masking to enable it to capture context from both sides of a word. The masking during training helps the model learn to predict words in context, simulating its real-world usage. While BERT’s design was successful, it does require a substantial amount of training data and resources. Its application may be more focused on tasks such as sentiment analysis, named entity recognition, and Question answering, while GPT is better at handling tasks such as content creation, text summarization, and machine translation11.

Image Source

Future Intent of BERT Authors. The authors of BERT might not have fully anticipated its exact future use and impact. While they likely foresaw its usefulness, the swift and extensive adoption of language models across diverse applications likely surpassed their expectations. The increasing accessibility and scalability of technology likely contributed to this rapid adoption. As mentioned by the professor in class, the decision to publish something in industry (and at Google in particular) often depends on its perceived commercial value. If Google were aware of the future commercial value of transformers and the methods introduced by BERT, they may not have published these papers openly (although this is purely speculation without any knowledge of the internal process that might have been followed to publish these papers).

Discussion Questions

Q: What makes language models different from transformers?

A language model encompasses various models that understand language, whereas transformers represent a specific architecture. Language models are tailored for natural languages, while transformers have broader applications. For example, transformers can be utilized in tasks beyond language processing, such as predicting protein structures from genomic sequences (as done by AlphaFold).

Q: Why was BERT published in 2019, inspiring large language models, and why have GPT models continued to improve while BERT’s advancements seem comparatively limited?

Decoder models, responsible for generating content, boast applications that are both visible and instantly captivating to the public. Examples like chatbots, story generators, and models from the GPT series showcase this ability by producing human-like text. This immediate allure likely fuels increased research and investment. Due to the inherent challenges in producing coherent and contextually appropriate outputs, generative tasks have garnered significant research attention. Additionally, decoder models, especially transformers like GPT-212 and GPT-313, excel in transfer learning, allowing pre-trained models to be fine-tuned for specific tasks, highlighting their remarkable adaptability.

Q: Why use 8-headers in the transformer architecture?

The decision to use 8 attention heads is a deliberate choice that strikes a balance between complexity and performance. Having more attention heads can capture more intricate relationships but increases computational demands, whereas fewer heads might not capture as much detail.

Q: BERT employs bidirectional context to pretrain its embeddings, but there is debate about whether this approach genuinely captures the entirety of language context?

The debate arises from the fact that while bidirectional context is powerful, it might not always capture more complex contextual relationships, such as those involving long-range dependencies or nuanced interactions between distant words. Some argue that models with other architectures or training techniques might better capture such intricate language nuances.

Wednesday: Training LLMs, Risks and Rewards

In the second class discussion, the team talked about LLMs and tried to make sense of how they’re trained, where they get their knowledge, and where they’re used. Here’s what they found out.

How do LLMs become so clever?

Before LLMs become language wizards, they need to be trained. The crucial question is where they acquire their knowledge.

LLMs need lots and lots of information to learn from. They look at stuff like internet articles, books, and even Wikipedia. But there’s a catch. They have a clean-up crew called “C4” to make sure the information is tidy and reliable.

Training LLMs requires potent computational resources, such as Graphics Processing Units (GPUs). Computationally-expensive large-scale training, while crucial for enhancing their capabilities, involves substantial energy consumption, which, depending on how it is produces may emit large amounts of carbon dioxide.

Transitioning to the practical applications of these language models, LLMs excel in diverse domains14. They can undergo meticulous fine-tuning to perform specialized tasks, ranging from aiding in customer service to content generation for websites. Furthermore, these models exhibit the ability to adapt and learn from feedback, mirroring human learning processes.

Risks and Rewards

In our class discussion, we had a friendly debate about LLMs. Some students thought they were fantastic because they can boost productivity, assist with learning, and bridge gaps between people. They even saw LLMs as potential problem solvers for biases in the human world.

But others had concerns. They worried about things like LLMs being too mysterious (like a black box), how they could influence the way people think, and the risks of false information and deep fakes. Some even thought that LLMs might detrimentally impact human intelligence and creativity.

In our debate, there were some interesting points made:

Benefits Group.

  • LLMs can enhance creativity and accelerate tasks.
  • They have the potential to facilitate understanding and learning.
  • Utilizing LLMs may streamline the search for ideas.
  • LLMs offer a tool for uncovering and rectifying biases within our human society. Unlike human biases, there are technical approaches to mitigate biases in models.

Risks Group.

  • Concerns were expressed regarding LLMs' opacity and complexity, making them challenging to comprehend.
  • Apprehensions were raised about LLMs potentially exerting detrimental influences on human cognition and societal dynamics.
  • LLMs are ripe for potential abuses in their ability to generate convincing false information cheaply.
  • The potential impact of LLMs on human intelligence and creativity was a topic of contemplation.

After the debate, both sides had a chance to respond:

Benefits Group Rebuttals.

  • Advocates pointed out that ongoing research aims to enhance the transparency of LLMs, reducing their resemblance to black boxes.
  • They highlighted collaborative efforts directed at the improvement of LLMs.
  • The significance and potential of LLMs in domains such as medicine and engineering was emphasized.
  • Although the ability of generative AI to produce art in the style of an artist is damaging to the career of that artist, it is overall beneficial to society, enabling many others to create desired images.
  • Addressing economic concerns, proponents saw LLMs as catalysts for the creation of new employment opportunities and enhancers of human creativity.

Risks Group Rebuttals.

  • They noted the existence of translation models and the priority of fairness in AI.
  • Advocates asserted that LLMs can serve as tools to identify and mitigate societal biases.
  • The point was made that AI can complement, rather than supplant, human creativity.
  • Although generating AI art may have immediate benefits to its users, it has long term risks to our culture and society if individuals are no longer able to make a living as artists or find the motivation to learn difficult skills.

Wrapping It Up. So, there you have it, a peek into the world of Large Language Models and the lively debate about their pros and cons. As you explore the world of LLMs, remember that they have the power to be amazing tools, but they also come with responsibilities. Use them wisely, consider their impact on our world, and keep the discussion going!


Introduction to Large Language Models (from Stanford course)

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. https://arxiv.org/abs/1706.03762. NeurIPS 2017.

These two blog posts by Jay Alammar are not required readings but may be helpful for understanding attention and Transformers:

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ACL 2019.

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, Iason Gabriel. Ethical and social risks of harm from Language Models DeepMind, 2021. https://arxiv.org/abs/2112.04359

Optional Additional Readings:

Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). On the Opportunities and Risks of Foundation Models

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. Conference of the North American Chapter of the Association for Computational Linguistics, 2018.

GPT1: Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.

GPT2: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019.

GPT3: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

Discussion Questions

Before 5:29pm on Sunday, August 27, everyone who is not in either the lead or blogging team for the week should post (in the comments below) an answer to at least one of these three questions in the first section (1–3) and one of the questions in the section section (4–7), or a substantive response to someone else’s comment, or something interesting about the readings that is not covered by these questions.

Don’t post duplicates - if others have already posted, you should read their responses before adding your own.

Questions about “Attention is All You Need” and “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”:

  1. Many things in the paper (especially “Attention is All You Need”) seem mysterious and arbitrary. Identify one design decision described in the paper that seems arbitrary, and possible alternatives. If you can, hypothesize on why the one the authors made was selected and worked.

  2. What were the key insights that led to the Transformers/BERT design?

  3. What is something you don’t understand in the paper?


Questions about “Ethical and social risks of harm from Language Models”

  1. The paper identifies six main risk areas and 21 specific risks. Do you agree with their choices? What are important risks that are not included in their list?

  2. The authors are at a company (DeepMind, part of Google/Alphabet). How might their company setting have influenced the way they consider risks?

  3. This was written in December 2021 (DALL-E was released in January 2021; ChatGPT was released in November 2022; GPT-4 was released in March 2023). What has changed since then that would have impacted perception of these risks?

  4. Because training and operating servers typically requires fresh water and fossil fuels, how should we think about the environmental harms associated with LLMs?

  5. The near and long-term impact of LLMs on employment is hard to predict. What jobs do you think are vulnerable to LLMs beyond the (seemingly) obvious ones mentioned in the paper? What are some jobs you think will be most resilient to advances in AI?

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. ↩︎

  2. Sherstinsky, A. (2020). Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. ↩︎

  3. Pascanu, R., Mikolov, T., & Bengio, Y. (2013, May). On the difficulty of training recurrent neural networks. In International conference on machine learning (pp. 1310-1318). Pmlr. ↩︎

  4. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. ↩︎

  5. Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. Advances in neural information processing systems, 27. ↩︎

  6. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. ↩︎

  7. Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A survey of transformers. AI Open. ↩︎

  8. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s), 1-41. ↩︎

  9. Karim, R. (2023, January 2). Illustrated: Self-attention. Medium. https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a ↩︎

  10. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. ↩︎

  11. Ahmad, K. (2023b, April 26). GPT vs. Bert: What are the differences between the two most popular language models?. MUO. https://www.makeuseof.com/gpt-vs-bert/ ↩︎

  12. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. ↩︎

  13. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. ↩︎

  14. Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., … & Hu, X. (2023). Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712. ↩︎

Github Discussions

Everyone should have received an invitation to the github discussions site, and be able to see the posts there and submit your own posts and comments. If you didn’t get this invitation, it was probably blocked by the email system. Try visiting:


(while logged into the github account you listed on your form).

Once you’ve accepted the invitation, you should be able to visit https://github.com/llmrisks/discussions/discussions/2 (the now-finalized discussion post for Week 1), and contribute to the discussions there.

Class 0: Getting Organized

I’ve updated the Schedule and Bi-Weekly Schedule based on the discussions today.

The plan is below:

Week Lead Team Blogging Team Everyone Else
Two Weeks Before Come up with idea for the week and planned readings, send to me by 5:29pm on Tuesday (2 weeks - 1 day before) - -
Week Before Post plan and questions in github discussions by no later than 9am Wednesday; prepare for leading meetings Prepare plan for blogging (how you will divide workload, collaborative tools for taking notes and writing) Read/do materials and respond to preparation questions in github discussions (by 5:29pm Sunday)
Week of Leading Meetings Lead interesting, engaging, and illuminating meetings!
Aim to include activities, discussions, whiteboard presentations, etc., not just showing powerpoint slides.
Take notes to prepare to write blog; participate actively in class meetings Participate actively in class meetings
Week After Help blogging team with materials, answering questions Write blog summary, submit PR Provide feedback on blog summary

For this week, Team 1 should finalize the post by this Thursday (5:29pm), and Team 2 should send me (email to evans@virginia.edu) the plan for week 2 by Sunday, 27 August and have materials ready for posting by the next Wednesday.

The teams have been posted at https://github.com/llmrisks/discussions/blob/main/teams.md (visible only to the class). Everyone should have gotten an invite to the github organization, but if not please check with me.


Some materials have been posted on the course site:

  • Syllabus
  • Schedule (you will find out which team you are on at the first class Wednesday)
  • Readings and Topics (a start on a list of some potential readings and topics that we might want to cover)

Dall-E Prompt: "comic style drawing of a phd seminar on AI"

Welcome Survey

Please submit this welcome survey before 8:59pm on Monday, August 21:


Your answers won’t be shared publicly, but I will use the responses to the survey to plan the seminar, including forming teams, and may share some aggregate and anonymized results and anonymized quotes from the surveys.

Welcome to the LLM Risks Seminar

Full Transcript

Seminar Plan

The actual seminar won’t be fully planned by GPT-4, but more information on it won’t be available until later.

I’m expecting the structure and format to that combines aspects of this seminar on adversarial machine learning and this course on computing ethics, but with a topic focused on learning as much as we can about the potential for both good and harm from generative AI (including large language models) and things we can do (mostly technically, but including policy) to mitigate the harms.

Expected Background: Students are not required to have prior background in machine learing or security, but will be expected to learn whatever background they need on these topics mostly on their own. The seminar is open to ambitious undergraduate students and research-focused graduate students with interests in machine learning, privacy, fairness, security, and related topics. Instructor permission is required to enroll, and decisions about enrollment will be based on what you are able to bring to the seminar.

Seminar Format: The details will be worked out later, but the basic structure will divide the class into three or four teams (somewhat like what was done , each with set responsibilities for each week of the seminar. One team will be responsible for leading the seminar, including selecting readings/viewings/activities for the rest of the course and leading discussions in class (with help from the instructor). Another team will be responsible for writing a “blog” that summarizes the content of the week.

Content: We expect the content will be a mix of technical background papers, recent research papers, and less formal writings and videos. Although we will focus on understanding technical aspects of the issues, we will also consider non-technical ones including societal impacts and legal and policy aspects.


Some initial ideas for course readings will be posted, but it will be largely up to the student teams leading to select good readings for the topics to consider.

Full Transcript