Web Application Penetration Testing Services: How to Evaluate a Provider

Pawel Woyke

28 Feb

•

This is some text inside of a div block.

min read

Security analyst performing web application penetration testing - reviewing HTTP traffic and application logic in a professional offensive security engagement

Placeholder — replace with: professional security researcher at screens, code/HTTP traffic visible, dark office setting | Alt text as above | Recommended size: 1200×500px

Most web application penetration testing services are scoped, priced, and delivered as if the application being tested doesn't matter. You get a standard quote based on the number of endpoints. A team you've never spoken to logs into a test environment. Two weeks later, a report arrives. It has CVSS scores, a severity heat map, and thirty findings that look thorough - until your developers start trying to reproduce them. Some can't be. The business logic vulnerability that should have been the centerpiece of the engagement isn't in the report at all - because the tester never understood your application well enough to look through it manually.

This is not an edge case. It is the default state of the market. The gap between a test that checks compliance boxes and a test that actually reduces your risk is wide, and the things that create that gap are not visible in a proposal. They show in what questions a provider asks before the work starts, what their methodology looks like when it encounters something unexpected, and what their report requires a developer to do after the work ends.

This post is written for security leaders who are at the comparison stage - evaluating providers, preparing questions for vendor calls, or trying to figure out why the last test felt unsatisfying. It covers what real web application penetration testing looks like in practice, what a report should contain, and what questions separate providers who have done this at serious depth from those who have not.

What Gets Sold vs. What Gets Delivered

Automated scanner output vs manual web application penetration testing report - the difference in finding quality and actionability

Placeholder — replace with: split-screen visual contrasting generic scanner findings vs manual tester findings with context | Alt text as above | Size: 1000×420px | Search: "vulnerability scanner vs manual pentest comparison"

There is a well-documented pattern in the penetration testing market: an automated vulnerability scanner runs against the target, the output is cleaned up and re-formatted, and the result is delivered as a penetration test report. Buyers generally can't tell this has happened until they try to act on the findings.

A scanner-only report has specific characteristics. Every finding traces back to a CVE with a known signature. The reproduction steps say something like "tool X flagged this endpoint for Y." The remediation guidance is a copy of the NVD advisory: "upgrade to version Z or apply vendor patch." There are no findings that require understanding what the application does - what data it processes, who its different user roles are, or what happens when you manipulate a sequence of requests that individually look valid but together expose something the application's author never intended. Those findings don't appear in scanner output because scanners cannot reason about application logic. They can only pattern-match against known signatures.

Manual testing produces a different kind of output entirely. A finding that comes from manual work has a specific request and a specific response. It explains what the tester did to trigger the condition, not what tool reported it. The CVSS score is calculated with context: if a vulnerability requires an authenticated low-privileged account to exploit, that changes the score relative to an unauthenticated one, and a manual tester will account for this. The remediation guidance tells a developer what to change in the code - not to "implement input validation" (every scanner says this) but to validate the specific parameter on the specific endpoint using a specific approach.

A false positive is the clearest sign of a scanner-only test. If a report flags a TLS configuration issue on an endpoint that your WAF terminates before it reaches the application, that's a finding that any tester who understood your architecture would have excluded. Its presence tells you either that no one verified the output manually, or that no one understood the environment well enough to verify it. Either way, that finding is wasted space - and wasted developer time when someone has to triage it.

What Manual Web Application Penetration Testing Covers

IDOR and broken access control testing - horizontal privilege escalation between user accounts in web application penetration testing

Placeholder — replace with: diagram showing two user accounts, unauthorized object access between them (IDOR/BOLA concept) | Alt text as above | Size: 1000×400px | Search: "access control security testing diagram" or "IDOR vulnerability concept"

Access control is where manual testing earns its cost. Most applications implement authorization at the endpoint level - they check whether a logged-in user can reach a resource. What they often fail to do is check whether User A can reach User B's resources if User A already knows (or can guess) the right identifier. Horizontal privilege escalation of this kind - CWE-639, Authorization Bypass Through User-Controlled Key - is invisible to scanners because scanners don't have accounts with meaningfully different privilege levels, and they don't hold state across requests the way a human tester does. A manual tester creates two accounts, maps what account A can access, then tests whether account B can access the same objects by substituting identifiers. They do this across every resource type in the application, not just the obvious ones.

Vertical escalation is a different problem but equally common. It's not just about whether a regular user can reach an admin endpoint - it's about whether the authorization check happens consistently. Some applications check roles at the UI layer but not at the API layer. Others check on read operations but skip it on write. A tester who doesn't map the complete authorization matrix - every operation, every role, every data object - will miss the gaps that matter.

Business logic vulnerabilities require understanding the application before you can find them. Consider a loan application platform where a user can upload a document, have it reviewed, and then move to the next step in the approval workflow. A tester who understands the process can ask: what happens if you submit step three before step two is complete? What happens if you re-submit a document after the review is already approved? What happens if you modify the application's internal state identifier between steps? None of these attack vectors have CVE signatures. They require a tester to understand what the application is supposed to do, then probe the boundaries where that logic can be manipulated.

Authentication testing goes well beyond rate limits on login endpoints. It includes session token entropy and predictability, the behavior of tokens after logout, concurrent session handling, password reset flow logic (a historically rich source of takeover vulnerabilities), OAuth implementation correctness, and the interaction between authentication state and application-level caching. A tester who asks only "is there rate limiting on the login page?" has tested maybe 10% of the authentication surface. The rest requires understanding how the application manages identity across its full lifecycle.

DOM-based XSS is a practical example of why scanner coverage is partial even for well-understood vulnerability classes. Reflected and stored XSS can be found by scanners with reasonable reliability because the injection point and the reflection point are both in the server's response. DOM-based XSS is different - the injection happens in the browser, driven by client-side JavaScript that reads from a source (like location.hash or document.URL) and writes to a sink (like innerHTML or eval). A scanner that doesn't execute JavaScript in a real browser context will miss these entirely. Manual testers read the application's JavaScript, identify sources and sinks, and trace the data flow. This takes time and judgment, which is exactly why it finds vulnerabilities that automated tools cannot.

Insecure deserialization is another class of vulnerability that scanners rarely surface correctly. As our technical research on Java deserialization exploitation demonstrates, finding and chaining gadgets requires understanding the application's dependencies at a level that no automated tool approaches. The same principle applies to template injection, SSRF with internal service enumeration, and second-order injection - vulnerabilities where the payload is stored and executed in a different context than where it was injected.

Why the Scoping Call Determines Everything That Follows

Threat modeling and scoping session before web application penetration testing - mapping user roles, data flows, and attack surface

Placeholder — replace with: small team at whiteboard mapping application architecture, trust model visible, professional security setting | Alt text as above | Size: 1000×400px | Search: "threat modeling whiteboard security team" or "pentest scoping session"

A threat modeling session before testing starts is not a formality. It is the point where a tester either earns the trust you're about to extend or reveals that they intend to run a standard playbook regardless of what they learn about your application.

The questions a provider asks before scoping a web application test tell you almost everything. Are they asking what data the application processes and who has access to it? Are they asking which user roles exist and what each is permitted to do? Are they asking whether there are integration points with other systems - APIs, payment processors, identity providers - and whether those need to be in scope? Are they asking what a successful attack would look like from your perspective? Are they asking whether you want black-box, grey-box, or white-box testing, and explaining what each changes about the coverage they can achieve?

If the pre-engagement conversation is primarily about IP ranges and whether you have a test environment, that is not threat modeling. That is scheduling. A provider who understands web application security will want to understand the application's purpose, its trust model, and the business context of its most sensitive operations before their first request touches the target.

Scope definition is also where the engagement either succeeds or fails for your developers. If the scope is defined as "all endpoints on domain X" without any mapping of user roles, authenticated flows, and critical business functions, the resulting report will be a list of technical issues without any indication of which ones represent real risk to real business operations. The best reports come from engagements where the tester understood what they were trying to break, not just what they were authorized to touch.

What a Web Application Pentest Report Must Contain

Web application penetration testing report quality - CVSS 3.1 scores with manual context, CWE classification, OWASP ASVS coverage table, and specific reproduction steps

Placeholder — replace with: side-by-side pentest report comparison: vague scanner output vs detailed manual finding with specific steps, CVSS, CWE | Alt text as above | Size: 1000×400px | Search: "penetration test report quality" or "security audit report documentation"

A well-structured finding contains five things: a clear description of the vulnerability, the exact steps to reproduce it (request, parameters, expected behavior, observed behavior), the CVSS 3.1 base score with manual context applied, the CWE classification, and the OWASP ASVS control reference for the test level scoped. If any of these are missing, the finding is incomplete for the purposes of actual remediation.

CVSS scores without manual context are actively misleading. An automated scanner will assign a CVSS 3.1 score based on the theoretical worst case for a vulnerability type. A tester working in your specific environment will adjust that score based on what they actually observed: does exploitation require authentication? Does it require user interaction? Does it affect other systems beyond the one being tested? A finding scored CRITICAL by a scanner may be HIGH in your environment because exploitation requires a logged-in administrator. Or it may remain CRITICAL because you have 10,000 end users with sufficient privilege. The difference matters for prioritization, and the difference is only visible when a human tester applies judgment.

CWE classification does more than satisfy a reporting template. It gives your developers a searchable reference to the class of weakness they need to fix, and it gives your security team a way to track whether the same type of issue recurs across engagements. A finding labeled CWE-639 tells a developer to look at their object-level authorization, not just the specific endpoint that was exploited.

An OWASP ASVS coverage table in the report tells you exactly what was tested and at what depth. In ASVS 5.0 - the current standard - Level 1 is a surface-level check using automated and manual testing sufficient for low-risk applications. Level 2 is what most commercial applications processing sensitive data should target: it covers authentication thoroughly, access control at the function and object level, cryptographic implementation, session management, and input validation in depth. Level 3 is high-assurance testing for applications where compromise has severe consequences - it goes into architectural review and static analysis alongside penetration testing. When a report includes the ASVS controls table mapped to the test level you scoped, you can see which controls were verified, which were outside scope, and which produced findings. That table is evidence of what was tested. Without it, you have a report with no way to verify coverage.

Remediation guidance should be specific enough that a developer can act on it without additional research. "Implement input validation" is not remediation guidance - it is a description of a category of fix. "Validate the document_id parameter server-side against the authenticated user's permitted document list before executing the database query on line 247 of DocumentController.php" is remediation guidance. The difference between these two is the difference between a useful report and a report that generates another meeting.

Questions to Ask a Web Application Penetration Testing Provider Before You Sign

These questions are designed to produce answers that reveal real things about how a provider works. For each one, a good answer and a bad answer are described.

Walk me through exactly what happens during your pre-engagement phase. What information do you need from us before testing starts, and what do you do with it?

Good answer: describes threat modeling, a conversation about user roles, critical business functions, integration points, and preferred test methodology. It involves the tester asking you questions, not just collecting login credentials and an IP range.
Bad answer: is primarily logistical - they ask for a scope document, test environment access, and a point of contact for blockers. If their pre-engagement process ends there, the test will not be tailored to your application.

Describe a business logic vulnerability you found in a recent web application engagement. How did you find it and what did it allow?

Good answer: is specific and concrete. The tester should be able to describe the application's intended workflow, the step they manipulated, and what the impact was.
Bad answer: is generic: "we test for business logic vulnerabilities as part of our methodology." If they cannot describe a real finding, they have not been finding these vulnerabilities in practice.

How do you handle false positives? What does your QA process look like before report delivery?

Good answer: describes an internal review process where findings are verified before the report leaves the team. The tester should be able to say that every finding in the report has been manually confirmed.
Bad answer: treats false positives as the client's problem: "we flag everything and let you decide what's relevant." This is a sign that the report is largely unverified scanner output.

What certifications do the testers actually assigned to this engagement hold? Will the same person who runs the test write the report?

Good answer: names the certifications that are relevant to web application testing (OSCP, OSWE, eWPTX, BSCP at minimum) and confirms that the tester who runs the engagement is accountable for the report. Certifications are not a guarantee of skill, but OSWE and eWPTX specifically require demonstrating manual exploitation of web vulnerabilities, which is a meaningful filter.
Bad answer: references company-level certifications like ISO 27001 without naming individual tester credentials, or describes a process where senior staff scope the work and junior staff execute it without any senior review.

Does your report include an OWASP ASVS coverage table? What ASVS level do you recommend for an application like ours, and why?

Good answer: explains the ASVS levels in terms of your application's risk profile and includes the coverage table in every delivery. The tester should have an opinion about which level fits your situation, and that opinion should be grounded in what the levels require.
Bad answer: says they "follow OWASP guidelines" without being able to describe what that means in terms of specific controls tested. OWASP ASVS as a name to drop versus OWASP ASVS as an structured test framework are very different things.

Do you use freelancers, or is the entire team in-house? What does your bench look like for this type of engagement?

Good answer: describes a stable, employed team where continuity of personnel is controlled. Freelancers introduce knowledge gaps and accountability problems - if a tester who used a contractor for part of an engagement cannot answer questions about a finding three months later, that's a real operational problem.
Bad answer: either doesn't address this directly or confirms freelance use without any explanation of how quality is controlled across different contractors.

Why AFINE

AFINE has been doing offensive security since 2015. The team is fully in-house - no freelancers - with every consultant holding at minimum an OSCP or eWPTX, and the team collectively holding OSCE, OSWE, OSED, OSEP, CRTO, BSCP, and CISSP among others. Every report goes through an internal QA process before delivery. Every web application report includes an OWASP ASVS coverage table for the scoped test level, CVSS 3.1 scores with manual context applied, and CWE classification per finding.

The team has disclosed over 150 CVEs in software from vendors including SAP, Microsoft, IBM, F5, Check Point, CyberArk, Rapid7, BMC, and OpenShift - work that requires finding vulnerabilities in hardened enterprise products that have already survived internal security review.

Frequently Asked Questions

The questions below reflect what security leaders and procurement teams most commonly ask when evaluating web application penetration testing services.

What is the difference between a web application penetration test and a vulnerability scan?

A vulnerability scan is an automated process that pattern-matches against a database of known CVEs and configuration issues. It runs fast, costs less, and produces output that requires significant interpretation to be useful - including a meaningful rate of false positives. A web application penetration test is a manual, adversarial assessment where a tester actively tries to compromise the application using the same techniques an attacker would. The key difference is that a penetration test can find business logic vulnerabilities, broken access control, and authentication flaws that have no CVE signature and will never appear in scanner output.

How much do web application penetration testing services cost?

For a typical commercial web application, a professional manual penetration test runs between EUR 8,000 and EUR 50,000+ depending on scope, complexity, number of user roles, and the depth of testing required (ASVS Level 1 vs. Level 2 vs. Level 3). Unusually low pricing is a reliable signal that the test is primarily automated - the most expensive component of any real penetration test is the time of experienced testers doing manual work. Framework agreements for organizations running multiple tests per year come with different economics and are worth exploring if recurring testing is part of your security program.

How long does a web application penetration test take from scoping to report delivery?

Plan for four to six weeks total. The pre-engagement phase - scoping call, threat modeling, contract execution - typically takes two to three weeks depending on your organization's procurement process. Active testing runs one to two weeks for most commercial applications, sometimes longer for complex multi-role applications or when API surfaces are extensive. Report writing and QA adds another week. Timelines can be compressed for urgent needs, but doing so can reduce the depth of coverage.

What is OWASP ASVS and which level does my application need?

The OWASP Application Security Verification Standard (ASVS) is a framework of security requirements and controls for web applications. Version 5.0 defines three levels. Level 1 is a surface check suitable for low-risk applications. Level 2 covers authentication, access control, session management, cryptography, and input validation in depth - this is the right target for any commercial application processing sensitive user data, financial transactions, or health records. Level 3 adds architectural review and is appropriate for applications where a breach has severe regulatory or operational consequences. The ASVS level determines what gets tested and provides the coverage table that appears in the report.

How often should web application penetration testing be performed?

At minimum, annually - and after any significant architectural change, new authentication system, new payment integration, or major feature addition. Organizations under PCI DSS are required to test after significant changes and at least annually. Regulated financial institutions and healthcare organizations typically run tests twice per year or more, with framework agreements that cover the application portfolio across the calendar year. The argument for more frequent testing is simple: the gap between when a vulnerability is introduced and when it's found is a window of exposure. Quarterly testing with a good provider closes that window faster than an annual engagement does.

What certifications should a web application penetration tester hold?

The certifications with the most signal for web application testing are OSWE (Offensive Security Web Expert), eWPTX (eLearnSecurity Web Application Penetration Tester eXtreme), and BSCP (Burp Suite Certified Practitioner). These require demonstrated manual exploitation of web vulnerabilities - they're not multiple choice exams. OSCP is a foundational credential that demonstrates general offensive capability. Treat ISO 27001 or CEH as company-level credentials, not individual tester credentials.

Can automated tools replace manual web application penetration testing?

No, and any provider claiming otherwise is selling something other than penetration testing. Automated scanners are good at finding known vulnerability signatures - outdated libraries, certain injection patterns, common misconfigurations. They cannot reason about application logic, cannot hold state across a multi-step authentication flow the way a human tester does, and cannot ask "what is this application supposed to do and what happens if I break that assumption?" DOM-based XSS, business logic flaws, broken object-level authorization (BOLA/IDOR), and chained vulnerabilities that require multiple steps to exploit are consistently missed by automated tools. This is not a deficiency that better tooling will fix - it's a structural limitation of pattern-matching against known signatures versus adversarial reasoning about a specific application.

Talk to the Team

If you're evaluating providers and want a direct conversation about your specific application and what a serious test of it would look like, book a consultation directly. You can also review AFINE's full CVE research record and learn more about our team and approach before reaching out.

‍