本文详细记录了我从随意打日志到建立完整日志规范体系的成长过程。内容包括日志级别选择、MDC全链路追踪、结构化日志、日志收集监控等实战经验,以及如何通过日志快速定位线上问题的真实案例。

1. 从System.out.println到专业日志体系

初期混乱的日志:

// 坏味道:随意使用print和日志混用
public void processOrder(Order order) {
    System.out.println("开始处理订单:" + order.getId()); // 1. 使用System.out
    log.info("订单金额:" + order.getAmount()); // 2. 日志级别不当
    System.out.println("用户ID: " + order.getUserId()); // 3. 关键信息分散
    
    try {
        orderService.validate(order);
        System.out.println("验证通过"); // 4. 无意义的日志
        orderDao.save(order);
        log.info("订单保存成功"); // 5. 缺乏上下文信息
    } catch (Exception e) {
        e.printStackTrace(); // 6. 错误的方式记录异常
    }
}

问题分析:

  • 日志级别混乱,无法快速过滤重要信息
  • 缺乏请求链路追踪,多个请求的日志混杂
  • 关键业务信息缺失,排查问题困难
  • 异常记录不规范,丢失堆栈信息

2. 建立日志级别规范

日志级别使用指南:

@Service
@Slf4j
public class OrderService {
    
    // ERROR:需要立即处理的问题
    public void processPayment(Order order) {
        try {
            paymentService.charge(order);
        } catch (PaymentFailedException e) {
            log.error("订单支付失败,订单号:{}, 用户:{}, 金额:{}, 错误原因:{}", 
                     order.getId(), order.getUserId(), order.getAmount(), e.getMessage(), e);
            // 发送告警,需要人工干预
            alertService.sendPaymentAlert(order, e);
        }
    }
    
    // WARN:异常但不需要立即处理
    public void checkInventory(Order order) {
        if (!inventoryService.hasStock(order.getSku(), order.getQuantity())) {
            log.warn("库存不足,订单号:{}, SKU:{}, 请求数量:{}, 当前库存:{}", 
                    order.getId(), order.getSku(), order.getQuantity(), 
                    inventoryService.getCurrentStock(order.getSku()));
            // 可以触发补货逻辑,但不需立即告警
        }
    }
    
    // INFO:关键业务节点
    public void createOrder(Order order) {
        log.info("开始创建订单,用户:{}, 商品数量:{}", 
                order.getUserId(), order.getItems().size());
        
        // 业务逻辑...
        
        log.info("订单创建成功,订单号:{}, 状态:{}", order.getId(), order.getStatus());
    }
    
    // DEBUG:调试信息,生产环境一般不开启
    public void calculatePrice(Order order) {
        log.debug("开始计算订单价格,商品列表:{}", order.getItems());
        // 复杂的计算逻辑
        log.debug("价格计算完成,原价:{}, 折扣:{}, 实付:{}", 
                 originalPrice, discount, finalPrice);
    }
    
    // TRACE:最详细的跟踪信息
    public void validateOrder(Order order) {
        log.trace("开始订单验证,订单详情:{}", JsonUtil.toJson(order));
        // 详细的验证步骤
        log.trace("地址验证通过:{}", addressValid);
        log.trace("支付方式验证通过:{}", paymentMethodValid);
    }
}

3. MDC实现全链路请求追踪

MDC配置和过滤器:

/**
 * MDC过滤器:为每个请求生成唯一TraceID
 */
@Component
@Slf4j
public class MDCFilter implements Filter {
    
    private static final String TRACE_ID = "traceId";
    
    @Override
    public void doFilter(ServletRequest request, ServletResponse response, 
                        FilterChain chain) throws IOException, ServletException {
        HttpServletRequest httpRequest = (HttpServletRequest) request;
        
        // 尝试从请求头获取TraceID,没有则生成新的
        String traceId = httpRequest.getHeader("X-Trace-Id");
        if (StringUtils.isEmpty(traceId)) {
            traceId = generateTraceId();
        }
        
        // 放入MDC
        MDC.put(TRACE_ID, traceId);
        
        // 设置响应头,便于前端追踪
        HttpServletResponse httpResponse = (HttpServletResponse) response;
        httpResponse.setHeader("X-Trace-Id", traceId);
        
        long startTime = System.currentTimeMillis();
        try {
            log.info("开始处理请求:{} {}, 客户端IP:{}, User-Agent:{}", 
                    httpRequest.getMethod(), httpRequest.getRequestURI(),
                    getClientIp(httpRequest), httpRequest.getHeader("User-Agent"));
            
            chain.doFilter(request, response);
            
        } finally {
            long duration = System.currentTimeMillis() - startTime;
            log.info("请求处理完成:{} {}, 状态码:{}, 耗时:{}ms", 
                    httpRequest.getMethod(), httpRequest.getRequestURI(),
                    ((HttpServletResponse) response).getStatus(), duration);
            
            // 清理MDC,防止内存泄漏
            MDC.clear();
        }
    }
    
    private String generateTraceId() {
        return UUID.randomUUID().toString().replace("-", "").substring(0, 16);
    }
    
    private String getClientIp(HttpServletRequest request) {
        // 获取真实客户端IP(考虑代理情况)
        String ip = request.getHeader("X-Forwarded-For");
        if (StringUtils.isEmpty(ip) || "unknown".equalsIgnoreCase(ip)) {
            ip = request.getHeader("Proxy-Client-IP");
        }
        if (StringUtils.isEmpty(ip) || "unknown".equalsIgnoreCase(ip)) {
            ip = request.getHeader("WL-Proxy-Client-IP");
        }
        if (StringUtils.isEmpty(ip) || "unknown".equalsIgnoreCase(ip)) {
            ip = request.getRemoteAddr();
        }
        return ip;
    }
}

// 异步任务中的MDC传递
@Configuration
public class MDCTaskDecoratorConfig {
    
    @Bean
    public TaskDecorator mdcTaskDecorator() {
        return runnable -> {
            Map<String, String> context = MDC.getCopyOfContextMap();
            return () -> {
                try {
                    if (context != null) {
                        MDC.setContextMap(context);
                    }
                    runnable.run();
                } finally {
                    MDC.clear();
                }
            };
        };
    }
}

// 在异步配置中使用
@Bean("taskExecutor")
public ThreadPoolTaskExecutor taskExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    // ... 其他配置
    executor.setTaskDecorator(mdcTaskDecorator());
    return executor;
}

4. 日志格式优化配置

logback-spring.xml配置:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <!-- 环境变量:开发/测试/生产 -->
    <springProperty scope="context" name="APP_NAME" source="spring.application.name" defaultValue="myapp"/>
    <springProperty scope="context" name="PROFILE" source="spring.profiles.active" defaultValue="dev"/>
    
    <!-- 控制台输出(开发环境) -->
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
            <!-- 结构化日志格式 -->
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} [%X{traceId}] - %msg%n</pattern>
        </encoder>
        <!-- 开发环境才输出到控制台 -->
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>DEBUG</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>DENY</onMismatch>
        </filter>
    </appender>
    
    <!-- 文件输出(所有环境) -->
    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>logs/${APP_NAME}.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>logs/${APP_NAME}.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>5GB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <!-- JSON格式,便于ELK收集 -->
            <pattern>
                {
                    "timestamp": "%d{yyyy-MM-dd HH:mm:ss.SSS}",
                    "level": "%-5level",
                    "thread": "%thread",
                    "logger": "%logger{36}",
                    "traceId": "%X{traceId}",
                    "message": "%msg",
                    "exception": "%ex"
                }%n
            </pattern>
        </encoder>
    </appender>
    
    <!-- 错误日志单独文件 -->
    <appender name="ERROR_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>logs/${APP_NAME}-error.log</file>
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>ERROR</level>
        </filter>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>logs/${APP_NAME}-error.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <maxFileSize>50MB</maxFileSize>
            <maxHistory>60</maxHistory>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} [%X{traceId}] - %msg%n</pattern>
        </encoder>
    </appender>
    
    <!-- 异步日志提升性能 -->
    <appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
        <discardingThreshold>0</discardingThreshold>
        <queueSize>1024</queueSize>
        <appender-ref ref="FILE" />
    </appender>
    
    <!-- 根日志配置 -->
    <root level="INFO">
        <appender-ref ref="CONSOLE" />
        <appender-ref ref="ASYNC_FILE" />
        <appender-ref ref="ERROR_FILE" />
    </root>
    
    <!-- 特定包日志级别 -->
    <logger name="com.example.service" level="DEBUG" />
    <logger name="org.springframework" level="WARN" />
    <logger name="org.hibernate" level="WARN" />
    <logger name="org.apache.kafka" level="WARN" />
</configuration>

5. 日志监控和告警体系

自定义Appender实现监控:

/**
 * 监控日志Appender,统计错误频率并告警
 */
public class MonitoringLogAppender extends AppenderBase<ILoggingEvent> {
    
    @Autowired
    private AlarmService alarmService;
    
    private final Map<String, AtomicInteger> errorCounts = new ConcurrentHashMap<>();
    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
    
    @Override
    public void start() {
        super.start();
        // 每分钟清理计数器并检查告警
        scheduler.scheduleAtFixedRate(this::checkAndAlert, 1, 1, TimeUnit.MINUTES);
    }
    
    @Override
    protected void append(ILoggingEvent event) {
        if (event.getLevel().isGreaterOrEqual(Level.ERROR)) {
            String loggerName = event.getLoggerName();
            errorCounts.computeIfAbsent(loggerName, k -> new AtomicInteger()).incrementAndGet();
            
            // 立即告警的关键错误
            if (isCriticalError(event)) {
                sendImmediateAlert(event);
            }
        }
    }
    
    private void checkAndAlert() {
        errorCounts.entrySet().removeIf(entry -> {
            String logger = entry.getKey();
            int count = entry.getValue().get();
            
            if (count > 10) { // 每分钟错误超过10次
                alarmService.sendAlert("HIGH_ERROR_RATE", 
                    String.format("Logger[%s] 错误频率过高: %d次/分钟", logger, count));
                return true; // 清理计数器
            }
            return count == 0; // 清理计数为0的
        });
    }
    
    private boolean isCriticalError(ILoggingEvent event) {
        return event.getMessage().contains("OutOfMemory") 
            || event.getMessage().contains("Connection refused")
            || event.getMessage().contains("Timeout");
    }
    
    private void sendImmediateAlert(ILoggingEvent event) {
        String message = String.format("关键错误告警: [%s] %s", 
            event.getLoggerName(), event.getFormattedMessage());
        alarmService.sendUrgentAlert("CRITICAL_ERROR", message);
    }
}

6. 实战案例:快速定位线上问题

​场景:​​ 用户反馈订单支付后状态未更新,TraceID: abc123

排查步骤:

# 1. 根据TraceID过滤日志
grep "abc123" application.log

# 输出结果:
# 2024-01-15 14:30:12.123 [http-nio-8080-exec-1] INFO  c.e.OrderController [abc123] - 开始处理支付回调
# 2024-01-15 14:30:12.234 [http-nio-8080-exec-1] INFO  c.e.PaymentService [abc123] - 验证支付签名成功
# 2024-01-15 14:30:12.345 [http-nio-8080-exec-1] INFO  c.e.OrderService [abc123] - 开始更新订单状态,订单号:1001
# 2024-01-15 14:30:12.456 [http-nio-8080-exec-1] ERROR c.e.OrderService [abc123] - 更新订单状态失败,订单号:1001
# 2024-01-15 14:30:12.567 [http-nio-8080-exec-1] INFO  c.e.OrderController [abc123] - 支付回调处理完成

# 2. 发现错误但无详细堆栈,查看错误日志文件
grep "abc123" application-error.log

# 输出详细异常:
# java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
# at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:127)

# 3. 结合其他日志分析数据库锁情况

7. 最佳实践总结

  1. 日志级别规范化​:ERROR>WARN>INFO>DEBUG>TRACE
  2. 全链路追踪​:每个请求有唯一TraceID
  3. 结构化日志​:JSON格式便于分析
  4. 异步记录​:避免I/O阻塞业务线程
  5. 监控告警​:自动检测异常模式
  6. 日志轮转​:控制文件大小和保存时间
  7. 敏感信息过滤​:避免密码、token等泄露

总结:

两年时间让我从"打日志就是print"的新手,成长为能够设计完整日志体系的工程师。​好的日志系统就像飞机的黑匣子,平时不显眼,关键时刻能救命。​​ 建立规范的日志体系,投资一次,受益终身。

如果觉得我的文章对你有用,请随意赞赏