Testing and Verification --- Quality Assurance in a Domain with Multiple Right Answers
When "Correct" Has More Than One Definition
In most software, testing is straightforward. Check that 1 + 1 equals 2. There is one right answer --- it is either correct or it is not. Testing a Saju app is fundamentally different.
For the same birth date and time, whether you apply True Solar Time correction, how you handle the Late Night Hour (Yajasi), and which school's rules you follow can each change what counts as the "right answer." When professional App A and professional App B produce different results, neither is necessarily wrong --- they may simply be applying different settings. In this situation, how do you prove that "our app is accurate"? That question was the starting point for the testing strategy.
Cross-Verification Against Professional Apps
The first step was comparing our results against existing professional apps. I revisited the apps benchmarked in Part 1 --- including popular Manselyeok tools and well-known Saju platforms --- and entered identical birth data across all of them, then cross-referenced with our app's output.
Verification test cases were organized into three types. First, "safe" cases far from any boundary. Dates and times comfortably distant from Solar Term boundaries, where any combination of settings should yield the same result. These verify basic Manselyeok accuracy. Second, Solar Term boundary cases. Births on the exact day of the Start of Spring, Awakening of Insects, and similar boundary dates. These test Solar Term time precision. Third, correction-sensitive cases. Births during the Late Night Hour window, DST periods, or near Hour Pillar boundaries. These are cases where results may vary depending on settings.
If the first type produces discrepancies, something is seriously wrong --- the core Manselyeok logic is broken. In actual verification, all apps agreed on these cases, and so did ours. Confirmation that the basic Manselyeok pipeline is working correctly.
The second type produced interesting findings. For births on the day of the Start of Spring, apps that process Solar Term timing to the minute diverged from apps that only track it to the day. Our app, powered by KASI minute-level data, aligned with the precise professional apps.
How True Solar Time and Late Night Hour Settings Shift Results
The third type yielded the most meaningful discoveries. A test case of July 15, 1987, 23:30, Seoul produced different Hour Pillars across different apps. Some showed the Hour of the Rat (Ja-si); others showed the Hour of the Pig (Hae-si).
Analyzing the root cause made the answer clear. This case involves all three corrections. 1987 means DST is active (subtract 1 hour). 23:30 falls within the Late Night Hour zone. Seoul requires True Solar Time correction. Which subset of the three corrections each app applies determines the result.
This discovery yielded two important lessons. First, "our result differs from another app" does not necessarily mean "we are wrong." If the settings are different, different results are normal. Second, users need to see "which settings are applied" transparently. If settings are hidden, users will be confused when they compare with another app.
As a result, I added a "correction settings display" to the app. The results screen explicitly shows "True Solar Time: Applied / DST: Auto-detected / Late Night Hour: Late Night Hour theory." When any setting is toggled, the result recalculates immediately so users can see the difference firsthand.
Writing Unit Tests with Vitest
After cross-verification confirmed overall accuracy, I wrote automated unit tests with Vitest. Manual cross-verification is useful for early-stage direction-setting, but repeating it every time code changes is impractical.
The test suite has four areas.
First, Manselyeok conversion tests. Given a specific Gregorian date, is the Lunar conversion correct? Is the Solar Term for that date accurate? Focused testing on dates with intercalary months, year-end/year-start, and Solar Term boundaries.
Second, pillar calculation tests. This is the most critical area. Given a specific birth date and time, do the Year, Month, Day, and Hour Pillars exactly match expected values? Expected values are manually verified against professional Manselyeok sites. At least 20 birth dates serve as test cases, with all four pillars checked for each.
Third, analysis result tests. Does the Five Elements ratio total 100%? Are Ten Gods classifications consistent with theoretical rules? Does the Favorable Element logic correctly classify Strong Self versus Weak Self? Spirit Indicator tests verify that expected indicators appear for known charts.
Fourth, Major Fate Cycle and Annual Fate tests. Is the starting age correct? Does the forward/reverse progression match gender and Year Stem polarity? Professional app results serve as expected values here as well.
Testing Strategy for "Multiple Right Answers"
The deepest deliberation went into test strategy for cases where results legitimately differ based on settings. A single expected value per test case is insufficient.
The adopted strategy: an expected-value matrix per settings combination. For one test case, define the expected result for every possible settings combination. For example, for a given birth: expected chart with "True Solar Time ON + Late Night Hour theory ON"; expected chart with "True Solar Time ON + Late Night Hour theory OFF"; expected chart with "True Solar Time OFF + Late Night Hour theory ON"; expected chart with "True Solar Time OFF + Late Night Hour theory OFF." Run the test under all four configurations.
The advantage: this verifies not just "correct result is produced" but also "changing setting X produces the expected change in output." It tests responsiveness to settings, not just correctness in isolation.
The downside: test case volume explodes. One birth date times 4 settings combinations times 4 pillars to verify equals 16 assertions. Across 20 birth dates, that is 320 assertions. But this cost is an investment in guaranteeing the app's core value: accuracy.
Using AI for Test Case Generation
Writing 320+ assertions by hand is impractical. AI collaboration was effective here. Asking Claude "Generate test cases meeting these conditions: Start of Spring boundary, Late Night Hour window, DST period, Hour Pillar boundary" produced appropriate birth dates and times for each scenario.
But a critical principle applies: never use AI-generated expected values directly in tests. The purpose is to verify whether AI's Saju calculations are correct --- using AI's own calculations as the expected values creates circular reasoning. AI suggests the "shape" of the test case (which dates and times to test). The expected values must come from an independent source (professional Manselyeok sites), manually verified.
This principle should always be kept in mind when writing tests with AI. If AI writes the code and AI writes the tests, the same error can enter both. At minimum, test expected values must come from an independent source.
Manselyeok Accuracy Determines Everything
The lesson I felt most acutely through testing: Manselyeok accuracy determines the entire app's value. This is not an abstract principle --- it is a reality experienced through actual verification.
Early in development, there was an instance where the Day Pillar calculation had a subtle error. For a certain date range, the Day Pillar was off by one. This single error cascaded into Hour Pillar determination (the Hour Stem depends on the Day Stem), Ten Gods analysis (the Day Stem is the reference point), Favorable Element determination (Day Stem strength is the core question), and Major Fate calculation (based on the Month Pillar). One wrong pillar made every analysis on top of it meaningless.
This experience cemented the conviction: "Test the Manselyeok first, and test it the most thoroughly." A 1--2% inaccuracy in the analysis engine's Five Elements ratio barely affects interpretation. But a wrong pillar itself renders everything downstream invalid.
Think of it as a building. The Manselyeok is the foundation. If the foundation is off by 1 centimeter, the 10th floor is off by dozens. The analysis engine is the structural framework from floors 1 to 10. The AI interpretation is the interior finishing. No matter how beautiful the finishes are, they are meaningless in a tilted building.
The Value of Regression Testing
Another value of unit tests: regression prevention. Every time the analysis engine is improved, a new Spirit Indicator is added, or weighting is adjusted, you can automatically verify that previously correct results have not broken.
There was an actual case where fine-tuning the Favorable Element weighting caused one existing test case to fail. The weight change flipped a specific chart's Strong/Weak classification. Without that test failure, the regression would have gone unnoticed --- and the user who eventually reported "this result seems wrong" would have triggered a far longer debugging cycle.
Integrating tests into the CI (Continuous Integration) pipeline ensures the full test suite runs automatically on every code change. In a Saju app, the guarantee that "modifying code preserves existing accuracy" contributes enormously to both development velocity and psychological confidence.
What I Learned Along the Way
First, in a domain with multiple right answers, an expected-value matrix per settings combination is the effective testing strategy. Single expected-value tests cannot verify the differences between settings.
Second, "differing from another app" in cross-verification does not necessarily mean an error. Suspect settings differences first. If results still differ under identical settings, then suspect an error.
Third, AI-generated test expected values must always be verified against an independent external source. When AI writes both the code and the tests, the same error can propagate to both.
Fourth, Manselyeok accuracy is the foundation of the entire app. If the pillars are wrong, the analysis is meaningless, and if the analysis is meaningless, the interpretation is meaningless. Testing priority should be: Manselyeok > pillar calculation > analysis > interpretation.
Fifth, regression testing is a safety net that provides confidence to "improve while preserving existing accuracy." In an app where accuracy is the core value, this safety net is essential.
Next Up
Manselyeok, analysis engine, and testing are all complete. Now the structured analysis data needs to be transformed into "human-readable interpretations." The hybrid architecture combining rule-based analysis with AI interpretation, the dramatic quality improvement when Gungtong Bogam (궁통보감, a classic reference on seasonal Element needs) data was added to prompts, and the strategy of "never let AI calculate --- only let AI interpret" --- Part 11 will cover it all.
댓글
댓글 쓰기