C code or compiler built-ins are preferable over inline assembler for
byte-swaps as it allows for better optimisations (e.g. instruction
scheduling) which would otherwise be impossible.
As with f64c2e710f for x86 and Arm,
this removes the inline assembler on GCC (and Clang) since we now
require recent enough compiler versions. This indeed seems to work on
AArch64, SuperH and, if Zbb is enabled, RISC-V. (AVR32 was not tested
since it has no known working compilers at this time.)
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but it looks like the
code was intended for old GCC specifically.)
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but the removed code
does not seem to have been intended for Clang.)
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Otherwise nothing is written into the destination when a write mapping
is requested.
For example, a vulkan frame mapped from a drm frame (which is wrapped as
a vaapi frame in the example) is used as the output of scale_vulkan
filter, it always gets a green screen without this patch.
ffmpeg -init_hw_device vaapi=va -init_hw_device vulkan=vulkan@va
-filter_hw_device vulkan -f lavfi -i testsrc=size=352x288,format=nv12
-vf
"hwupload,scale_vulkan,hwmap=derive_device=vaapi:reverse=1,format=vaapi,hwdownload,format=nv12"
-f nut - | ffplay -
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Zbb static Zbb dynamic I baseline
popcount 1.336129286 3.469067758 20.146362909
popcountl 1.336322291 3.340292968 20.224829821
(seconds for 1 billion iterations on a SiFive-U74 core)
Signed-off-by: Paul B Mahol <onemda@gmail.com>
This adds runtime support to use Zbb REV8 for 32- and 64-bit byte-wise
swaps. The result is about five times slower than if targetting Zbb
statically, but still a lot faster than the default bespoke C code or a
call to GCC run-time functions.
For 16-bit swap, this is however unsurprisingly a lot worse, and so this
sticks to the baseline. In fact, even using REV8 statically does not
seem to be beneficial in that case.
Zbb static Zbb dynamic I baseline
bswap16: 0.668184765 3.340764069 0.668029012
bswap32: 0.668174014 3.340763319 9.353855435
bswap64: 0.668221765 3.340496313 14.698672283
(seconds for 1 billion iterations on a SiFive-U74 core)
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Due to hysterical raisins, most RISC-V Linux distributions target a
RV64GC baseline excluding the Bit-manipulation ISA extensions, most
notably:
- Zba: address generation extension and
- Zbb: basic bit manipulation extension.
Most CPUs that would make sense to run FFmpeg on support Zba and Zbb
(including the current FATE runner), so it makes sense to optimise for
them. In fact a large chunk of existing assembler optimisations relies
on Zba and/or Zbb.
Since we cannot patch shared library code, the next best thing is to
carry a flag initialised at load-time and check it on need basis.
This results in 3 instructions overhead on isolated use, e.g.:
1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported)
LBU rd, %pcrel_lo(1b)(rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
The C compiler will typically load the flag ahead of time to reducing
latency, and can also keep it around if Zbb is used multiple times in a
single optimisation scope. For this to work, the flag symbol must be
hidden; otherwise the optimisation degrades with a GOT look-up to
support interposition:
1: AUIPC rd, GOT_OFFSET_HI
LD rd, GOT_OFFSET_LO(rd)
LBU rd, (rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
This patch adds code to provision the flag in libraries using bit
manipulation functions from libavutil: byte-swap, bit-weight and
counting leading or trailing zeroes.
Signed-off-by: Paul B Mahol <onemda@gmail.com>
It will fallback to mach_absolute_time inside libavutil/timer.h
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
done by accident in 6a7c4d60a1498929c2a366f2ef4ccc35621a4358.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
The function pointer is appended to the structure for backward binary
compatibility. Fortunately, this is allocated by libavutil, not by the
user, so increasing the structure size is safe.
Signed-off-by: Paul B Mahol <onemda@gmail.com>
A constant (-1) is added to the length value, so we can have an added
for free, and optimise the addition away if the addend is exactly 1.
Signed-off-by: Paul B Mahol <onemda@gmail.com>
This is test code after all so it should test things
Fixes: CID1518990 Unchecked return value
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Failure is possible due to strdup()
Fixes: CID1516764 Dereference null return value
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Fixes: CID1538296 Structurally dead code
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Fixes: The warnings from CID1598553 Uninitialized scalar variable
Passing partly initialized structs is ugly and asking for hard to rieproduce bugs,
The uninitialized fields where not used
Reviewed-by: "Xiang, Haihao" <haihao.xiang-at-intel.com@ffmpeg.org>
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Make it work with the source which has a dynamic frame pool.
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Make it work with the source which has a dynamic frame pool.
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
When AVHWFramesContext.initial_pool_size is 0, a dynamic frame pool is
required. We may support this under certain conditions, e.g. oneVPL 2.9+
support dynamic frame allocation, we needn't provide a fixed frame pool
in the mfxFrameAllocator.Alloc callback.
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Add AVQSVFramesContext.info and update the description.
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
vtype_vli computes the VTYPE value with the optimal LMUL for a given
element width, tail and mask policies and a run-time vector length.
vtype_ivli does the same, but with the compile-time constant vector
length.
vwtypei and vntypei can be used to widen or narrow a VTYPE value for
use in mixed-width vector-optimised functions.
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Use the machdep.altivec sysctl on NetBSD for AltiVec detection
as is done with OpenBSD.
Signed-off-by: Brad Smith <brad@comstyle.com>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
The B Bit manipulation extension was not defined to this day, and
probably never will. Instead it was broken down into Zba, Zbb, Zbc and
Zbs with no particular blessed set to make up B.
This removes the bogus field test. Linux never set this bit, nor
(AFAICT) did FreeBSD or any other OS. We can always add it back in the
unlikely event that it gets taken into use.
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Not all C run-times support this, and even then, it will be a while
before distributions provide recent enough versions thereof.
Since this is a trivial system call wrapper, we might just as well call
the corresponding kernel system call directly where the C run-time lacks
support but the kernel headers are new enough (as is the case on Debian
Unstable at the time of writing). In doing so, we need to add a few more
guards as the first suitable kernel (headers) release did not expose the
V, Zba and Zbb extensions.
Signed-off-by: Paul B Mahol <onemda@gmail.com>
This inline function checks that the vector length is at least a given
value. With this, most run-time VLEN checks can be optimised away.
Signed-off-by: Paul B Mahol <onemda@gmail.com>