There are multiple use-cases for performing pose recognition - sports analysis apps, AR, gesture control. In any of them you are likely to need a pose recognition model. We certainly did at bform.
Disclaimer: We haven't run extensive benchmarks with a scientific approach. The opinions expressed are based on testing the models exclusively for our use-case.
tl; dr
We tried using PoseNet on TensorFlow Lite, OpenPose, Google's MLKit and Apple's native Vision API (available starting iOS 14) to perform pose-recognition on device in iOS. For our use case Apple's Vision API came out the most accurate and performant of the bunch.
Our requirements
We process a video of a runner, filmed from the side, and are especially interested in accuracy and performance. Also in our iOS app we wanted to perform the analysis on-device, i.e. not leverage a server.
This is how our requirements are ordered:
Accuracy - we must provide good results. Otherwise the app would not be useful.
Performance - we analyse a few-second long video while the user is waiting. This means typically analysing ~200 poses in the matter of seconds.
Size - it's embedded in the app, so size is also an important criteria.
All tests referenced below are performed on an iPhone XS, using the latest versions of the models at the time of writing.
PoseNet on TensorFlow Lite
PoseNet seems to be one of the first models one comes across while searching for Pose recognition. We ran it with TensorflowLiteSwift:
- Accuracy - 5/10 - for videos filmed from the side. It seems to do well when the camera is facing the subject, but on side views it often doesn't catch the legs or puts both legs in the same position.
- Performance - 5/10 - it takes a couple of seconds per frame, meaning users could need to wait minutes for a longer video.
- Size - 9/10 - The started model is only 12MB in size, which is great for embedding in an app.
OpenPose with BODY_25
OpenPose doesn't offer a native iOS implementation. We ran ours by leveraging the work in SwiftOpenPose and changing it to support the body_25 of the model.
- Accuracy - 7.5/10 - it is more accurate than PoseNet, though it would also often switch the left and right legs of the subject. A large bonus for it is that it recognises more points than most models (25 vs 17). The feet are especially important for us.
- Performance - 6/10 - slightly better than PoseNet, but would still require about about 2-3s per second of video (at 24FPS).
- Size - 3/10 - The model we created after conversions ended up at 100MB - a substantial increase in size for the app.
Note: OpenPose requires purchasing a license if it's used for commercial purposes.
Google MLKit
Google's MLKit is tailored for use by mobile developers and as such is exceptionally easy to use. We used the PoseDetectionAccurate variant.
- Accuracy - 6/10 - it's like a mix between PoseNet and OpenPose in this regard. It includes the most points of all - 33, but for videos filmed from the side it seems to lack in accuracy.
- Performance - 6/10 - about the same as OpenPose.
- Size - 8/10 - The model is about 20MB in size which isn't a substantial increase.
Apple's Vision API
With iOS 14 Apple introduced native pose recognition in it's Vision API.
- Accuracy - 8.5/10 - it rarely misses key points on videos filmed from all angles. It even fairly accurately estimates the arm of the runner on the opposing side of the camera. The only downside compared to the rest is that it only includes 19 points (no feet).
- Performance - 10/10 - it's definitely the fastest one we tried. We reckon it could be used to estimate poses on a live video accurately as well.
- Size - 10/10 - it's a native API, so you shouldn't expect a size increase.
Conclusion
We ended up relying on apple's vision API for on-device processing for it's accuracy and performance. Stay tuned for details on our server-side implementation.
Please get in touch if you think we missed a key thing in our observations or if we should try something else.
I have a question did you guys try to do pose estimation on multiple persons using apple vision api? if yes than can share give a hint how can i perform multiple pose estimation on persons. I tried but it only detects 1 person. it will be really helpful for me if you just give me a hint or clue how you guys detect on multiple.