model.generate single CPU core bottleneck

### System Info

I'm running inference on a GPU EC2 instance using CUDA.  After doing a little profiling I noticed the model.generate method was the clear bottleneck.  Upon closer inspection running htop showed that during this method call only a single cpu core is used and is maxed out to 100%.  I've made sure sure all of my weights, biases and activations are all on the gpu.  Running nvidia-smi shows the proper amount of VRAM usage.  My question is, is there a way to speed this method up using multiple CPU cores?  I haven't dug deep into the method to see exactly what is causing the issue yet.  Just curious if there's something obvious I'm missing.

(Basic sample code)
```
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
import torch

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

model = LlamaForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)

PROMPT = f"""### Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
### Input: "What is the difference between a llama and a vicuna?"
### Response:"""

inputs = tokenizer(
    PROMPT,
    return_tensors="pt",
)
input_ids = inputs["input_ids"].cuda()

generation_config = GenerationConfig(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.15,
)

  generation_output = model.generate(
      input_ids=input_ids,
      generation_config=generation_config,
      return_dict_in_generate=True,
      output_scores=True,
      max_new_tokens=128,
  )

for s in generation_output.sequences:
    print(tokenizer.decode(s))
```

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Run this method with any model and profile the cpu cores to see it's only using a single core:
generation_output = model.generate(
      input_ids=input_ids,
      generation_config=generation_config,
      return_dict_in_generate=True,
      output_scores=True,
      max_new_tokens=128,
  )

### Expected behavior

Multiprocessing to help balance the load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model.generate single CPU core bottleneck #24524

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model.generate single CPU core bottleneck #24524

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions