-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Updated issue] Prompt + Generation is limited to n_ctx. #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, I am having the same issue. |
I tricked it into working by increasing n_ctx to 2400
Also 2049 wtf. |
Happily running with a
|
Do context sizes beyond 2048 make any sense for llama based models that just have been trained up to that context size of 2048? |
I couldn't get the |
Don't get me wrong. I'm not trying to go beyond 2048, i'm trying to force model run within 2048 context size. It's current self imposed limit is around for 1650 due to some bug. Workaround only tricks it work "as intended" |
Since your prompt processing was just 88 tokens, I'm not sure I'm getting your point here. This has nothing to do with the discussion. Try sending a large first prompt (around 1800 tokens but below 2048 with n_ctx=2048). Then it will generate 0 tokens. Judging by your data you were just sending small prompts to the model which was not our point at all. Our point is that when you send a large prompt even below 2048 ctx, the AI will not generate anything. And when chatting with the model, the max ctx is around 1600 instead of 2048. |
Can you send a reproducible example? This has not been my experience using long prompts and a context size of 8192. It is possible of course that you're hitting an edge case. |
I use ooba here (actually it's SillyTavern and ooba works as api), but ooba itself isn't responsible for anything.
This is results with workaround where i load model with n_ctx = 2400. As you can see it generates smoothly at context close to 2k. Now, i'll change nothing, literally same prompt but i'll reload model with n_ctx = 2048
Here we go:
Fails successfully as expected. Now i will load back to larger n_ctx
and now it works again as expected:
Windows 11. CuBLAS, latest version of everything: |
Admittedly the perplexity isn't at all good, but as per @jmtatsch that's likely due to llama's designed context of 2048:
|
And god saw it wasn't good so he gave us https://huggingface.co/epfml/landmark-attention-llama7b-wdiff |
Now let's pray to the llama gods of Apache 2.0 that they also attend to our pleas for large contexts. |
Sure. I'm using Ooba, but according to Priestru is a general issue with llama.cpp python. I've loaded a ggml 5_1 13b model in Ooba with max context of 2048, max new tokens 200 (this is default and important. When trying to reproduce the issue, please do not use 8K context). Then I send this prompt to the model which is just a bit over 1900 tokens: Please complete the following text:
"In most reciprocating piston engines, the steam reverses its direction of flow at each stroke (counterflow), entering and exhausting from the same end of the cylinder. The complete engine cycle occupies one rotation of the crank and two piston strokes; the cycle also comprises four events – admission, expansion, exhaust, compression. These events are controlled by valves often working inside a steam chest adjacent to the cylinder; the valves distribute the steam by opening and closing steam ports communicating with the cylinder end(s) and are driven by valve gear, of which there are many types.[citation needed]
The simplest valve gears give events of fixed length during the engine cycle and often make the engine rotate in only one direction. Many however have a reversing mechanism which additionally can provide means for saving steam as speed and momentum are gained by gradually kshortening the cutoffk or rather, shortening the admission event; this in turn proportionately lengthens the expansion period. However, as one and the same valve usually controls both steam flows, a short cutoff at admission adversely affects the exhaust and compression periods which should ideally always be kept fairly constant; if the exhaust event is too brief, the totality of the exhaust steam cannot evacuate the cylinder, choking it and giving excessive compression (kkick backk).[60]
In the 1840s and 1850s, there were attempts to overcome this problem by means of various patent valve gears with a separate, variable cutoff expansion valve riding on the back of the main slide valve; the latter usually had fixed or limited cutoff. The combined setup gave a fair approximation of the ideal events, at the expense of increased friction and wear, and the mechanism tended to be complicated. The usual compromise solution has been to provide lap by lengthening rubbing surfaces of the valve in such a way as to overlap the port on the admission side, with the effect that the exhaust side remains open for a longer period after cut-off on the admission side has occurred. This expedient has since been generally considered satisfactory for most purposes and makes possible the use of the simpler Stephenson, Joy and Walschaerts motions. Corliss, and later, poppet valve gears had separate admission and exhaust valves driven by trip mechanisms or cams profiled so as to give ideal events; most of these gears never succeeded outside of the stationary marketplace due to various other issues including leakage and more delicate mechanisms.[58][61]
Compression
Before the exhaust phase is quite complete, the exhaust side of the valve closes, shutting a portion of the exhaust steam inside the cylinder. This determines the compression phase where a cushion of steam is formed against which the piston does work whilst its velocity is rapidly decreasing; it moreover obviates the pressure and temperature shock, which would otherwise be caused by the sudden admission of the high-pressure steam at the beginning of the following cycle.[citation needed]
Lead in the valve timing
The above effects are further enhanced by providing lead: as was later discovered with the internal combustion engine, it has been found advantageous since the late 1830s to advance the admission phase, giving the valve lead so that admission occurs a little before the end of the exhaust stroke in order to fill the clearance volume comprising the ports and the cylinder ends (not part of the piston-swept volume) before the steam begins to exert effort on the piston.[62]
Uniflow (or unaflow) engine
Main article: Uniflow steam engine
Animation of a uniflow steam engine.
The poppet valves are controlled by the rotating camshaft at the top. High-pressure steam enters, red, and exhausts, yellow.
Uniflow engines attempt to remedy the difficulties arising from the usual counterflow cycle where, during each stroke, the port and the cylinder walls will be cooled by the passing exhaust steam, whilst the hotter incoming admission steam will waste some of its energy in restoring the working temperature. The aim of the uniflow is to remedy this defect and improve efficiency by providing an additional port uncovered by the piston at the end of each stroke making the steam flow only in one direction. By this means, the simple-expansion uniflow engine gives efficiency equivalent to that of classic compound systems with the added advantage of superior part-load performance, and comparable efficiency to turbines for smaller engines below one thousand horsepower. However, the thermal expansion gradient uniflow engines produce along the cylinder wall gives practical difficulties.[citation needed].
Turbine engines
Main article: Steam turbine
A rotor of a modern steam turbine, used in a power plant
A steam turbine consists of one or more rotors (rotating discs) mounted on a drive shaft, alternating with a series of stators (static discs) fixed to the turbine casing. The rotors have a propeller-like arrangement of blades at the outer edge. Steam acts upon these blades, producing rotary motion. The stator consists of a similar, but fixed, series of blades that serve to redirect the steam flow onto the next rotor stage. A steam turbine often exhausts into a surface condenser that provides a vacuum. The stages of a steam turbine are typically arranged to extract the maximum potential work from a specific velocity and pressure of steam, giving rise to a series of variably sized high- and low-pressure stages. Turbines are only efficient if they rotate at relatively high speed, therefore they are usually connected to reduction gearing to drive lower speed applications, such as a ship's propeller. In the vast majority of large electric generating stations, turbines are directly connected to generators with no reduction gearing. Typical speeds are 3600 revolutions per minute (RPM) in the United States with 60 Hertz power, and 3000 RPM in Europe and other countries with 50 Hertz electric power systems. In nuclear power applications, the turbines typically run at half these speeds, 1800 RPM and 1500 RPM. A turbine rotor is also only capable of providing power when rotating in one direction. Therefore, a reversing stage or gearbox is usually required where power is required in the opposite direction.[citation needed]
Steam turbines provide direct rotational force and therefore do not require a linkage mechanism to convert reciprocating to rotary motion. Thus, they produce smoother rotational forces on the output shaft. This contributes to a lower maintenance requirement and less wear on the machinery they power than a comparable reciprocating engine.[citation needed]
Turbinia – the first steam turbine-powered ship
The main use for steam turbines is in electricity generation (in the 1990s about 90% of the world's electric production was by use of steam turbines)[3] however the recent widespread application of large gas turbine units and typical combined cycle power plants has resulted in reduction of this percentage to the 80% regime for steam turbines. In electricity production, the high speed of turbine rotation matches well with the speed of modern electric generators, which are typically direct connected to their driving turbines. In marine service, (pioneered on the Turbinia), steam turbines with reduction gearing (although the Turbinia has direct turbines to propellers with no reduction gearbox) dominated large ship propulsion throughout the late 20th century, being more efficient (and requiring far less maintenance) than reciprocating steam engines. In recent decades, reciprocating Diesel engines, and gas turbines, have almost entirely supplanted steam propulsion for marine applications.[citation needed]
Virtually all nuclear power plants generate electricity by heating water to provide steam that drives a turbine connected to an electrical generator. Nuclear-powered ships and submarines either use a steam turbine directly for main propulsion, with generators providing auxiliary power, or else employ turbo-electric transmission, where the steam"
The generation immediately stops, this is how it looks in the WebUI: In commandline: "Output generated in 0.31 seconds (0.00 tokens/s, 0 tokens, context 1933, seed 1692940365)" Please refer to #307 This is the exact same issue. We don't want longer context than 2048 (atleast right now) we want to send long prompts within the 2040 tokens window without the generation stopping entirely. |
Again, I don't see the problem with |
Which OS are you running? I've noticed Priestru and me are using the same OS (Windows 11). BTW just because you can't reproduce it, doesn't mean the issue is invalid. |
I encountered this issue on Ubuntu 22.04 (GeForce 1080 ti, if that matters). |
Can you post your code, please? |
This happened with text-generation-webui. Sorry for not mentioning that. |
No problem. I'm sure any text-generation-webui developer reading this issue will jump in and fix it immediately. |
Sarcasm aside, text-generation-webui uses this library for text generation for llama based models, which is why OP opened this issue in the first place. |
I get short responses that are cut off when I use stream completions in server mode, is this related? |
Are you sure you're just not hitting the generation limit? That's usually the case when that happens to me. |
It could well be. Do you have a curl request to easily reproduce the problem? I spent an hour trying to reproduce the problem from the OP's limited description, but without the specifics of exactly how llama-cpp-python is being called the issue is likely not going to get identified and fixed. |
I get very long answers to the same query with curl when I don't stream. Can I even stream with curl though? I'm streaming with the openai Python api. |
Knowing that it is an issue with the streaming API helps, thanks. It explains why I couldn't reproduce it with the high level API example. |
When this happens, the finish_reason is "length" by the way. Happens with stream and no-stream python clients so maybe it's just me. With curl the reponse is nice and finish_reason is "stop". |
Earlier today I asked @abetlen to look into this more. Can I please confirm the versions of |
I used |
|
Using 0.1.59 as well. But I don't know how to check the version for Textgen. |
0.1.59 for cpp-python, but bug has been present in previous one too. About ooba i can only say that i use the latest one. Also ooba there is another issue of somewhat similar kind that seems to going to be dismissed as cpp-python problem, i feel somewhat reserved to create new issues at this point. |
I hit this again, and I disabled text streaming in text-generation-webui to understand what's happening.
|
Valuable info. Thx. |
In the issue you linked to is a stack trace that directly points to a |
I think this narrows it down, create_completion would throw an error when |
@Priestru this is related to #183 actually, but thanks for reporting, I'll try to implement a fix that works outside of the server too. The issue is that likely ooba is using a single Llama object in memory, when you click regenrate the previous request is still running but a new one comes in as well, this causes inconsistencies in the underlying library. Best workaround atm is what the llama-cpp-python server does and wrapping it in a lock but this is not a good solution as it doesn't allow for easy generation interruption. |
@Priestru I've created an |
0 token generation for larger prompts has been fixed in the newest update n_ctx is default (2048)
It only generates until it hit of total 2048 tokens. Sum of initial prompt + output. Previously discovered workaround saves the day once again because it allows it to generate normally. I set n_ctx to 2500. It results in:
|
1.62 or smth doesn't fix it
Should i make new issue to add visibility? |
The llama 7B model is giving me very small responses, input and output as below: Endpoint: http://localhost:PORT/v1/chat/completions
Response body:
|
Uh oh!
There was an error while loading. Please reload this page.
[Update] Issue below is fixed with new bug emerging from the fix. See #331 (comment)
The text was updated successfully, but these errors were encountered: